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SAMPLING IN EDUCATIONAL RESEARCH! 


E. F. LINDQUIST 
College of Education, State University of Iowa 


The purposes of this paper are to elucidate certain fundamental 
principles of sampling that have been generally neglected in educa- 
ional research, to show how serious have been the consequences of this 
eglect, and to offer some general suggestions for the improvement of 
ur sampling procedures. I should like to begin by discussing very 
riefly two of the most important of the standards by which a sampling 
rocedure may be judged. 

One of the principal criteria of a good sample may be described as 
s efficiency, if I may use this term in a broader and less exact sense 
han that in which it is used by R. A. Fisher. The purpose of any 
ample may be defined as that of yielding an estimate of some charac- 
pristic, such as the mean, of the population from which it is drawn. 
ther things being equal, the value or adequacy of the sample is 
heasured by the dependability of this estimate, but by this I do not 
bean to imply that the sample must yield a highly dependable estimate 
order to be considered a satisfactory sample. How dependable the 
stimate need be depends upon the broader purposes of the investiga- 
on. For some purposes, very crude or approximate estimates are all 
hat are needed. For others, very high reliability is necessary. 
egardless, however, of the degree of reliability needed, it is obviously 
sirable that this degree be secured at a minimum cost. In other 
ords, there is never any excuse for using an undependable estimate if 
better one can be as cheaply secured, or for spending more time and 
hergy than is necessary to secure the desired reliability. One of the 
ost important criteria of a good sampling procedure, then, is that 
pplied in the question: ‘‘Has the sample been so selected as to yield 





1 Paper read at the annual meeting of the American Educational Research 
sociation in St. Louis, Feb. 25, 1940. 
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the maximum dependability in the estimate, per unit of time and effort 
expended in securing it?”’ 

It is obvious that this question cannot be answered unless we have 
some way of measuring the reliability of the estimate obtained. The 
second criterion, then, is concerned, not with whether the true relia- 
bility is high or low, but with the validity with which that reliability is 
measured or described. The importance of this criterion should be 
readily apparent; yet it has been so far neglected in educational 
research that many studies have contained no description whatever of 
the reliability of the findings. There seems to be need, then, to stress 
even so elementary a truth as that, unless we have some notion of how 
reliable an estimate is, we might as well have no estimate at all. An 
estimate whose reliability is unknown can have no scientific value 
whatever, no matter how reliable it may be in actuality. To admit 
that we do not know how much confidence may be placed in an infer- 
ence based on a sample is to admit that it may deserve no confidence at 
all. The value of an estimate, then, depends not only upon how 
reliable that estimate actually is, but also upon how certainly we know 
what degree of reliability it does possess. Of the two, the latter con- 
sideration is probably of the greater importance. In other words, if 
we have available two estimates, for only one of which we have an 
objective and dependable measure Of its reliability, we might do better 
to use that estimate even though we believe but cannot demonstrate 
that the other is really the more reliable. 

It should be made clear that this does not imply that we must in all 
instances have an objective measure of reliability, or that to be of any 
value every estimate must be accompanied by its standard error. On 
the contrary, we may in some instances have learned by experience that 
samples drawn in a certain way from certain types of populations have 
rarely been seriously misrepresentative, or we may know that our 
sample is like the population in certain characteristics known to be 
related to the trait in which we are interested, and hence may have 
reason to believe that it will be like the population in that trait also. 
In such instances, we may be able to place ‘‘considerable”’ confidence 
in the inferences based upon our samples, even though we cannot 
describe the degree of reliability in exact terms, that is, even though our 
description of reliability is both subjective and indefinite. Clearly, 
however, an objective description of reliability is much to be preferred. 
It is surely much better, for example, to be able to state with a definable 
and known degree of confidence that the mean of a population lies 
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between certain definite limits than only to be able to say vaguely that 
we believe our estimate of the mean is ‘‘sufficiently”’ reliable for our 
purposes. 

There are, then, two questions which we need most of all to raise in 
evaluating the procedure used in a sampling study. The first is, 
“How efficient is the sampling, or how dependable is the estimate 
obtained in relation to the time and effort expended in securing it,”’ and 
the second is, ‘‘ How valid and objective is the description provided of 
the reliability of the estimate?” 

It should be clear that these criteria are concerned only with the 
techniques employed in the selection of the sample and in the error 
analysis, and that there are many other important criteria which would 
have to be considered in the general or over-all evaluation of a sampling 
study. It is particularly important not to confuse the idea of the 
efficiency of a sample, as here defined, with its adequacy in relation to a 
given purpose. For example, if two achievement examinations were 
closely alike in reliability, it would ordinarily be futile to attempt to 
determine which is the more reliable by a comparison of reliability 
coefficients obtained from a sample of one hundred cases. Neverthe- 
less, this sample might be relatively very efficient, that is, it might yield 
as much information about the reliability of each of these tests as we 
could hope to secure from any sample of this size. 

Let us now consider the adequacy, as judged by these criteria, of the 
sampling techniques that have been generally employed in educational 
research. We may note, first of all, that in educational research we 
have thus far relied almost exclusively for the measurement of sampling 
error upon the standard error formulas which were designed for simple 
random samples of considerable size. (In some instances, in experi- 
mental studies involving comparisons between matched groups, we 
have tried to allow for the correlation between the measures compared, 
but in almost no instances have we recognized more than one restriction 
upon the simple random selection of individual cases.) However, not 
only have practically all of our samples not been simple random sam- 
ples, but the populations involved have been such that it is almost 
impossible to draw simple random samples from them. Most of the 
populations in which we are interested in educational research are, for 
obvious reasons, populations of school children. These populations 
almost invariably consist of readily identifiable subgroups, such as the 
pupils in different schools or in different classes and under different 
teachers. Now a simple random sample, when drawn from a popula- 
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tion of this kind, may be defined as one which is equally likely to consist 
of any one of the possible combinations of members from the subgroups 
of the population. A true random sample of one hundred cases, for 
instance, might consist of pupils distributed among almost one hundred 
different school systems, with many systems providing no more than 
one pupil each. Such sampling is usually utterly impracticable; first, 
because it requires that the entire population be catalogued; second, 
because many of the subgroups, whether for political or other rea- 
sons, are inaccessible to us; and, third, because even those that are 
accessible must usually be treated as indivisible units—that is, while we 
may be permitted to use an entire class of pupils, we are rarely per- 
mitted to break up these classes and to measure, observe, or experiment 
with only a fraction of each class. 

While these restrictions upon random sampling have sometimes 
been recognized in educational research, most research workers have 
apparently considered them either as of minor practical consequence 
or as something about which nothing could be done. At any rate, we 
have gone on using the familiar standard error formulas, in which N 
is the number of pupils, and have attempted to justify this practice, if 
we have recognized the need for any justification, by contending that 
our samples have been nearly enough equivalent to random samples so 
that the formulas will yield useful approximations. It is just here, 
however, that the real fallacy lies. This contention would be valid 
only if all of the subgroups were closely alike in whatever trait was 
being investigated. If all schools, for example, were alike in level and 
spread of educational achievement, then it would matter very little in a 
study of achievement how many schools were represented in a sample. 
In this case, since variability of pupil achievement within schools 
would be the only source of sampling error, a sample consisting of 
pupils only from a single school would be just as good as one of equal 
size in which a great many schools were represented. Actually, how- 
ever, pupils in different schools or in different classes do differ very 
markedly in nearly all of the traits with which research studies are 
concerned. It is well known, for example, that the differences in 
average educational achievement from school to school are typically 
almost as large as differences in individual pupil achievement within a 
single school. Not only do the pupils in different schools differ in such 
things as past achievement and in general learning ability, but they 
differ also in their relative ability to profit by different types of instruc- 
tion—that is, due to differences in the previous educational experiences 
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of the pupils, the same method of instruction may actually be a good 
method in one school and a poor method in another. Differences 
between schools and between classes are, therefore, a major source of 
variation in samples drawn from these populations, but the familiar 
standard error formulas, as we have been using them, are not intended 
to take these sources of variation into consideration at all. The result 
is that most of the standard errors we have computed have been very 
seriously biased. In general, the computed standard errors have been 
very much too small, leading us to be overly confident of our conclu- 
sions, but in some cases the results obtained have been more reliable 
than we have considered them to be. The iatter has been particularly 
true in experimental studies involving several schools, in which the 
experimental design has been such as to control or equalize the school 
differences, but in which the error analysis has not allowed for the 
effects of these controls upon the reliability of the results obtained. 
The appropriate method of error analysis with samples of this type 
cannot be considered in any detail in a paper of this length, but the 
essential features of the methods can be very simply described in 
general terms. The major requirement in sampling from populations 
consisting of intact and relatively homogeneous subgroups is that the 
subgroup, and not the individual, be considered as the true unit of 
sampling. This means that in most studies concerned with school 
populations the effective size of the sample depends upon the number of 
schools involved, and not upon the number of pupils. For instance, a 
sample consisting of four hundred pupils taken from five randomly 
selected schools must be considered as a sample of five cases, rather 
than as a sample of four hundred. The general mean of the sample 
must be considered as the weighted mean of the five school means 
rather than as the mean of four hundred measures, and the reliability 
of this mean depends upon the variability of these five school means 
rather than upon the variability of the four hundred pupil scores.! 





1 How the mean of a sample consisting of a given number of randomly selected 
schools would compare in reliability with that of a simple random sample of the 
same number of pupils would depend upon the relative variability of school means 
and of pupil scores. On available standardized tests of educational achievement 
the variability (SD) of school means (at the junior and senior high school levels) 
is typically about half as large as the variability of pupil scores within schools. 
In studies of school achievement, therefore, it follows that the mean of a sample 
consisting of a given number of randomly selected schools would have about the 
same reliability as the mean of a simple random sample of four times as many 
pupils. For example, a mean achievement score based on the ninth-grade pupils 
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The second general requirement in sampling from populations of 
this type is that the error analysis be based upon what is known as 
“exact”? or small sample theory. This requirement is, of course, a 
consequence of the fact that, when measured in terms of the school as a 
unit, most of our samples are indeed small, frequently consisting of less 
than five or six cases. There are various ways of analyzing the results 
of a sampling study so as to observe these two requirements, but 
perhaps the most convenient and most generally applicable is the 
method of analysis of variance as developed by R. A. Fisher. 

It might be worth while, at this point, to consider one or two 
typical illustrations of the practical consequences of this fallacy of 
considering the pupil as the unit and of using large sample theory when 
dealing with school populations. One of the most important uses of 
sampling in education has been in the establishment of norms on tests of 
educational achievement. In general, we have judged the dependa- 
bility of these norms by the number of pupils involved. In fact, this 
has usually been the only information concerning the dependability of 
the norms that has been provided by the test publishers. To take just 
one example from this field, one of the most widely used of all standard- 
ized tests has for years carried norms based upon the administration of 
the test in just twenty-four school systems. The number of pupils 
involved was quite large—about ninety-eight hundred—but over four 
thousand of these pupils came from just one school system and over 
three thousand of the remainder from just six school systems, that is, 
just seven schools accounted for over seven thousand of the ninety- 
eight hundred cases used. It is possible, of course, that these twenty- 
four schools were more representative of nation-wide achievement than 
if they had been selected strictly at random, but this is at least open to 
question, particularly since seven of these twenty-four schools were 
located in the same state. At any rate, the sample must be considered 
as consisting of twenty-four cases rather than of ten thousand cases, 
and hence as much less dependable statistically than we have generally 
believed it to be. Incidentally, it could be readily demonstrated that 
by using available methods of stratified sampling one could establish a 
more reliable norm on the basis of just one or two hundred pupils than 
one could on the basis of ten thousand pupils taken from only twenty- 
four randomly selected schools. Let me add, in case you recognize the 





in five randomly selected schools (regardless of the number of pupils involved) 
would ordinarily be no more reliable than the mean of a sample of twenty pupils 
selected strictly at random from the same ninth-grade population. 
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test to which I have referred, that the norms for this test are probably 
far better than those for the great majority of all standardized tests, 
and that if those norms had not by a good fortune turned out to be 
reasonably satisfactory, that fact would undoubtedly have been dis- 
covered and rectified long since. The important consideration, of 
course, is that for any test the number of pupils on which any norm is 
based is of relatively minor significance and that the reliability of the 
norm depends primarily upon the number of schools involved and upon 
how they are selected. It is not surprising, in consideration of the 
general neglect of this principle, that instances have been reported in 
which the same group of pupils were, on the average, at the fourth- 
grade norm on one test and at the eighth-grade norm on another test 
in the same subject. 

Another illustration typical of the fallacious error analyses in many 
experimental studies in education may be found in an investigation 
which received considerable attention about a year ago. ‘The purpose 
of this investigation was to compare educational achievement and 
other traits of pupils in so-called Progressive and so-called traditional 
elementary schools. Most of the comparisons were based upon sam- 
ples of about three hundred pupils from each population, but only six 
schools were represented in each sample. The differences in mean 
scores on the various tests were evaluated in the study by using the 
familiar formula for the standard error of the difference in means of 
simple random samples, the pupil being considered as the unit. Quite 
aside from the fact that the school and not the pupil was the true unit 
of sampling, the assumption of random sampling was clearly unjustified 
in this case, since a deliberate attempt had been made to control 
chance factors by selecting pairs of schools from like communities and 
by matching the pupils with respect to intelligence and other factors 
within each pair of schools. Incidentally, such purposive or deliberate 
selection of schools always introduces the danger of serious bias, for 
which no allowance can be made by mathematical formulas. Accord- 
ing to the analysis made in the study, nearly all of the differences 
appeared to be highly significant, and were so reported. However, if 
more valid methods of analysis are applied to the same results—that is, 
if the school is considered as the unit and allowance is made for the 
correlation between the means of the paired schools—it may be readily 
shown that most of the differences reported could be readily due to 
chance alone, even though one assumes that the purposive selection of 
schools was free from bias. On a test of “Current Beliefs,” for 
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example, which I believe is fairly typical of the others, the difference in 
means was reported as three and one-half times its standard error, and 
hence as highly significant. A more valid analysis revealed that this 
difference was no larger than would have resulted by chance alone in at 
least one in every five samples of this size, even though the samples 
were drawn from identical populations. Again I should say that this 
study was perhaps no more faulty in the validity of the analysis 
employed than the great majority of experimental studies which have 
been reported in the literature of educational research. On the 
contrary, this was one of the studies singled out by the American 
Educational Research Association at its meetings a year ago for special 
recognition of the quality of the techniques employed. 

Thus far this discussion has been concerned directly only with the 
second of the criteria mentioned earlier, that is, with the validity of the 
measures of sampling error which have been generally provided in 
educational research. While it is in this respect, perhaps, that our 
techniques have been most seriously at fault, there have undoubtedly 
been many instances, also, in which the efficiency of our samples could 
have been readily improved. 

Sampling studies in education may in general be roughly classified 
as experimental and non-experimental, or as concerned either with 
experimental or with observational data. In experimental studies the 
problem of securing increased efficiency is essentially that of improving 
the design of our experiments and of making more effective use of 
statistical controls. The questions of experimental design and statisti- 
cal control are too involved to permit any useful discussion of them in a 
paper of this length. Accordingly, I shall confine my discussion of 
the problem of securing increased efficiency to studies of the non- 
experimental type. 

This category includes the very large class of studies in education 
which have been known as “survey” or “status” studies, such as 
studies of salaries, tenure, and professional training of teachers, studies 
of administrative practices in school systems, and studies of adult 
interests, activities, and needs. In this broad field of research, one of 
the most efficient types of sampling is that which has been variously 
known as “‘stratified’”’ sampling, ‘‘controlled” sampling, or “‘sampling 
by subgroups.”’ While samples of this general type have already been 





1 For an extended discussion of these problems, see Lindquist, E. F.: Statistical 
Analysis in Educational Research. Houghton Mifflin Co., 1940. 
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quite widely used in educational research, there does seem to be some 
need, in general, for a better understanding of the underlying theory, 
and, in particular, for a wider knowledge of the appropriaté methods of 
error analysis. I should like, therefore, to give particular * ‘attention to 
this latter problem. 

As far as I am able to tell, there is not as yet any genes -_s agreed 
upon or standard definition of a stratified sample. For the purposes of 
this discussion, however, we may define a stratified sample 4s one which 
consists of subgroups (or of subsamples) each of which barfen drawn 
at random from a corresponding subgroup of the population (or from a 
corresponding subpopulation). I have selected this definition because, 
as far as I know, it is only for stratified samples so defined that we now 
have available any objective methods of error analysis. The impor- 
tant considerations in this type of sampling may perhaps best be 
explained in terms of a concrete example. The example which I have 
selected is hypothetical, and is somewhat artificial in that it avoids 
certain of the complexities which usually characterize actual studies of 
this type; but for just these reasons it may serve better as a basis for 
elucidating the basic principles involved than any actual example that 
is available. 

Suppose, then, that for some reason we wish to estimate the mean 
score that would be made on a certain test of contemporary affairs by 
all resident students in a certain large university, that our facilities will 
not permit us to test more than four or five hundred students, that we 
wish to select these cases so that they will yield the maximum informa- 
tion about the entire population, and that we, therefore, decided to 
employ a stratified sample. The first step in drawing a stratified 
sample from this population would be to divide it into subpopulations 
from each of which a subsample may be drawn at random. In this 
case, as in any other case of stratified sampling, there are many bases 
upon which this subdivision could conceivably be made. In this 
instance, there would be readily available in the registrar’s office 
considerable information about the population, such as the age dis- 
tribution, the distribution of grade-point averages, and the distribu- 
tions by departments or colleges, by sex, by religious affiliations, etc. 
Any or all of the available bases could, of course, be used. That is, we 
might first divide the population according to departments or colleges, 
then subdivide each departmental group according to sex, and then 
further subdivide these groups according to chronological age, etc. 
Our object, however, would be so to subdivide the population that each 
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subpopulation would be as homogeneous as possible in knowledge of 
contemporary affairs, or so that the subpopulations would show the 
largest possible differences in this trait. Furthermore, for reasons of 
convenience and economy, we would want to hold the number of 
subpopulations to the minimum effective number. Accordingly, we 
would first try to decide which of the possible bases of classification are 
most likely to be highly related to knowledge of contemporary affairs. 
We might observe, for example, that there are likely to be large 
systematic differences in this respect between students in certain 
departments of the university. Rather than use as many subdivisions 
as there are departments, however, we might lump together those 
departments most likely to be alike in this respect, and isolate those 
most likely to differ from the others. Suppose, then, that by following 
departmental boundaries we have thus divided the population into, 
say, four major groups. We might next observe that knowledge of 
contemporary affairs is likely to be fairly highly related to general 
scholastic ability as measured by grade-point averages. Accordingly, 
within each of the departmental groups we might subdivide the 
students into, say, five grade-point intervals, thus making twenty sub- 
populations in all. While some of the other possible bases for classi- 
fication might be related to knowledge of contemporary affairs, we 
might decide that further subdivision would not be worth the trouble 
involved. It should be noted, incidentally, that whenever multiple 
classification is employed, the various classifications used should 
preferably show low intercorrelations, but each should be related as 
highly as possible to the trait investigated. In this case, for instance, 
while grade-point averages are not likely to differ greatly from depart- 
ment to department, both the departmental classification and the 
grade-point classification are likely to be significantly related to a 
knowledge of contemporary affairs. 

Suppose, then, that we have decided to use twenty subpopulations 
as just described. The next step would be to list the students con- 
stituting each subpopulation, select from each list the desired number 
of students at random, and administer the test to these students. 
Ordinarily, we would make the number drawn from each subpopulation 
proportional to the size of the population, but this is not always the 
case, as we shall see later. The next step would be to estimate the 
mean of the total population. This estimate is the weighted mean of 
the subsample means, the weights being either the numbers in the 
corresponding subpopulations, or numbers which are proportional to 
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the subpopulation numbers. To compute the weighted mean, then, 
we would multiply each subsample mean by its appropriate weight, and 
divide the sum of these products by the sum of the weights. This 
procedure may be expressed by the formula 


=n,’ M; 


M ‘s =n,’ 


in which M is the weighted mean of the total sample, or the estimated 
population mean, M; the mean of the 7th subsample, and in which n,’ 
represents either the size of the corresponding subpopulation or a 
number in the same proportion. This formula is needed, of course, 
only when the sizes of the subsamples are not already proportional to 
the weights; when they are already in this proportion, the estimate 
needed is simply the general mean of the total sample. 

To estimate the standard error of this weighted mean, it is first 
necessary to estimate the variance of the mean of each subsample. 
This is done by the usual formula, 


ae, 
—S n(n am 1) 


in which oy,? is the estimated variance of the mean in the 7th sub- 
sample, d; is a deviation from the mean of the subsample, and 7; is the 
number of cases in the subsample. The estimated standard error (cx) 
of the weighted mean may then be computed from the formula! 
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It is apparent from these formulas, as was suggested earlier, that 
the advantage of stratified over simple random sampling is directly 


dependent upon the homogeneity of the subsamples, or, indirectly/ / 
upon the magnitude of the differences between the subsample means,) , 


It should be noted that these formulas yield unbiased estimates regard- 
less of the relation between the sizes of the subsamples and of the corre- 
sponding subpopulations. This relationship, however, is extremely 
important, since it in part determines the reliability of the weighted 
mean. Neyman? has shown that if the subpopulations are equally 





'The number of degrees of freedom for this estimated standard error is 
=(ni — 1). 

* Neyman, J.: “On the Two Different Aspects of the Representative Method.” 
Journal of the Royal Statistical Society, Vol. xcvu, 1934, pp. 558-625. 





~~ the pee 





d 


sel age tee ae mele nite en ly 


572 The Journal of Educational Psychology 


variable in the trait measured, the weighted mean will be most reliable 
when the numbers in the subsamples are proportional to the numbers 
in the corresponding subpopulations. In this case, the estimated 
standard error of the mean depends only upon the average variance 
of the subsamples, and can be most readily computed by the method 
of analysis of variance. If the subpopulations differ in variability, 
then the weighted mean will be most reliable when the numbers in the 
subsamples are proportional to the products of the numbers in, and 
the standard deviations of, the corresponding subpopulations. Again 
expressed as a formula, the optimum number to be selected from sub- 
population 7 is given by 


a os ny ia) 
: ZNiwi 


in which n,’’ is the optimum number to be selected from the ith 
subpopulation, n is the total number of cases to be drawn in all sub- 
samples, and o; is the standard deviation of the 7th subpopulation. 

It is not always possible to take advantage of this latter relation- 
ship, since usually no advance estimates are available of the standard 
deviations of the subpopulations. To make use of this relationship, we 
would ordinarily have to do our sampling in two steps. In this 
example, for instance, while we might plan eventually to use a sample 
of four hundred cases, we might begin by taking a sample of two 
hundred, making the numbers in this sample proportional only to the 
numbers in the subpopulations. We could then, on the basis of the 
data secured from this preliminary sample, estimate the variability of 
each subpopulation. The optimum size of the subsamples in a total 
sample of four hundred could then be computed by substituting these 
estimated standard deviations in the formula last given. We could 
then select the additional two hundred cases so as to bring the numbers 
in each subsample up to these optimum numbers, that is, so as to make 
the final numbers in the subsamples proportional to the products of the 
numbers and the estimated standard deviations in the subpopulations. 
It is apparent that this procedure would seldom be practical or con- 
venient. However, it is important to take variability into consider- 
ation only when the differences in variability are of appreciable magni- 
tude, and in many educational research situations, such as the example 
I have used, this condition might be quite unlikely. Furthermore, we 
may sometimes be interested in a certain subpopulation for its own 
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sake, and may wish to have a reliable estimate for that subpopulation 
regardless of how small it may be in relation to the others. In that 
case, we might draw a larger subsample from that subpopulation than 
would be demanded by the preceding considerations. 

The foregoing example illustrates in very bare outline most of the 
important theoretical considerations in stratified sampling. The 
important assumptions are that all subpopulations are represented, and 
that the subsamples were selected at random. The reliability of the 
estimate then depends upon how well we have succeeded in dividing 
the population into subpopulations which are more homogeneous than 
the whole. 

As I have suggested earlier, this example is artificially simple in 
comparison to most actual instances of stratified sampling. In the 
first place, I used an example in which the entire population had 
already been catalogued, and about which considerable useful informa- 
tion was already available. In most actual applications of stratified 
sampling, very little if anything would be known about the population 
in advance, and in practically no instances would the entire population 
be catalogued. In many situations, therefore, we would have to run a 
preliminary large-scale sampling study in order to obtain some estimate 
of how the members of the population are distributed with reference to 
the “controls” that we intend to use. (By a “control” I refer to a 
basis of classification. For instance, in the example already used, a 
departmental control and a control of grade-point averages were 
utilized.) Such a preliminary study would be worth while only if the 
information needed for the controls could be very much more readily or 
economically secured than the information we intend to obtain in our 
finalsampling. Suppose, for instance, that the study requires informa- 
tion that can be obtained only through relatively long personal inter- 
views, or very elaborate questionnaires, but that valuable control data 
can be readily secured by telephone, by indirect inquiry, or by very 
simple questionnaires which could be distributed by mail. In that 
case, it might be worth while to run a preliminary study on a large 
scale to secure the information necessary for the later control of the 
final, more intensive sampling. In general, the desirability of a 
preliminary sampling of this type would depend upon the degree of 
relationship between the intended “control” variables and the variable 
to be investigated, and upon the economy with which the control 
information could be secured. What procedure we would employ, 
then, would eventually depend upon the total cost of the double 
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sampling procedure as compared to the cost of a less closely controlled 
sample large enough to yield equally dependable results. 

In the second place, the fact that in most actual applications the 
entire population would not be or could not be catalogued would make 
it impossible to insure strictly random sampling from the various sub- 
populations. Within each subpopulation, then, we might be forced to 
resort to what is an essentially fortuitous or accidental sampling, and 
hence the formulas could not be expected to yield completely unbiased 
results. However, if all possible precautions against bias are exercised 
in the selection of the subsamples, the formulas should yield at least 
very useful approximations to the true values. 

It should be noted finally that stratified samples as they are actually 
obtained are seldom very much more reliable than strictly random 
samples of the same size would be, since it is seldom possible to sub- 
divide the population into groups that are very much more homo- 
geneous than the population as a whole. Stratified sampling is, 
nevertheless, important not so much because it is more efficient or more 
reliable than simple random sampling, but because simple random 
sampling is in most instances impracticable, and because the uncon- 
trolled, fortuitous sampling which might otherwise be used is so much 
more likely to be characterized by very serious bias. Perhaps, no 
better illustration of that can be found than in the polls of public 
opinion which preceded the 1936 presidential election. The Literary 
Digest poll, while based on tremendous numbers, was very seriously 
biased because it gave very inadequate representation to certain classes 
in the population, such as persons in the low income brackets, the 
unemployed and those on relief, and the younger element in the 
population. By exercising very simple controls over income, age, 
geographic distribution, etc., Gallup was able to obtain a much better 
estimate with a sample of only thirty thousand cases. A strictly 
random sample of thirty thousand might have been just as good as 
Gallup’s, since within his controls Gallup’s selection was by no means 
random, but of course a strictly random sample was something the 
Literary Digest could not manage. In conclusion, then, I will venture 
the opinion that stratified sampling will be used increasingly in educa- 
tional research, not primarily because it minimizes the chance errors in 
random selection, but because through it we may avoid some of the 
large systematic errors which have so often resulted in haphazard or 
fortuitous sampling, and which are doubly serious because their 
presence or magnitude can never be gauged by mathematical formulas. 
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SPEED AND QUALITY OF ASSOCIATION AS A 
MEASURE OF VOCABULARY KNOWLEDGE 


MILES A. TINKER, FLORENCE HACKNER AND MARION W. WESLEY 


University of Minnesota 


Conrad and Harris! found a correlation of .83 between Alpha scores 
and hard words in a free-association test. Credit was given for each 
associate that indicated comprehension of the stimulus word. In a 
check experiment by Tinker,? the comparable coefficients range from 
A7 to .52. He also found that high intelligence scores tended to 
accompany speedy association as shown by correlations ranging from 
—.20 to —.38. These results suggest that there is some relationship 
between both speed and quality of association and intelligence. 
Further investigation with a somewhat different approach to the 
problem may be profitable. 

Obviously a stimulus word must be clearly comprehended to give 
rise to an association that is a synonym, an antonym, a definition, or a 
word that is connected through usage in a logical manner to the 
stimulus. Also, it is clear that comprehension must be present to 
indicate consistently the correct definition of items in well-constructed 
vocabulary tests. It would seem, therefore, that vocabulary knowl- 
edge might be adequately measured by free association as well as by 
the traditional form of vocabulary test. Furthermore, the relationship 
discovered between quality of free association and vocabulary test 
scores should indicate to what extent the free association is measuring 
verbal intelligence since the correlation between vocabulary tests and 
verbal intelligence tests is usually high. 

Does speed of association bear a significant relation to vocabulary 
knowledge? If a word is clearly comprehended it seems reasonable to 
assume that associations connected with the word will arise more 
readily than when comprehension is less clear or absent. 

The purpose of this experiment is to measure the relationship 
between: (1) Speed of association and vocabulary knowledge, and (2) 
quality of association and vocabulary knowledge. By quality is meant 





1 Conrad, H. 8S. and Harris, D.: ‘“‘The Free-Association Method and the Meas- 
urement of Adult Intelligence.” University of California Publications in Psychol- 
ogy, Vol. v, No. 1, 1931, pp. 1-45. 

* Tinker, M. A.: Speed and Quality of Association as a Measure of Intelligence 
on the College Level. J. Gen. Psychol. (In press.) 
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whether or not the association given indicates that the stimulus word 
has been comprehended. To investigate these problems two experi- 
ments were conducted. 


PART I 


In the first experiment speed of association for easy stimulus words, 
speed and quality of association for hard stimulus words, and speed and 
quality of responses to a vocabulary test were measured. A series of 
fifty words from the most frequent five hundred in Thorndike’s Word 
List was employed for the easy words. Another fifty words from 
Thorndike’s nineteen thousand and twenty thousand frequencies were 
used for the hard words. In both series the words were chosen at 
random. Each word was printed in twelve-point type on the center of 
a three- by five-inch card. 

The cards, each series separately, were oonssianih toa subject face 
down. Instructions were to turn the card.over at the “‘go”’ signal, read 
the stimulus word and then to speak the first word (association) 
brought to mind by the stimulus word. Time of response was meas- 
ured with a stopwatch and both time and the association recorded. 
Subjects were urged to give their response as quickly as possible after 
seeing the stimulus word. 


TABLE I.—MEANS AND STANDARD DEVIATIONS OF THE MEASURES EMPLOYED IN 








Part I 

Measure . Mean score SD 
Easy word association: Time...................2000- 108.6 seconds | 29.5 
Hard word association: Time...................0000: 135.4 seconds | 36.4 
Hard word association: Comprehension............... 28.1 items 7.8 
Vocabulary A: Score in 5 minutes................... 44.1 items 12.1 
Vocabulary A: Score unlimited time................. 72.7 items 14.2 
I ee a aie on ba ek ee een 12.2 minutes 3.8 
Vocabulary B: Score in 5 minutes................... 43.0 items 14.8 
Vocabulary B: Score unlimited time................. 70.1 items 15.5 
PE GPE UNE BID ii nc ck icdsevesccvceedsoccs 12.6 minutes 3.8 











Two forms of the vocabulary test which makes up Part I of the 
Minnesota Reading Examination for College Students! were employed 
to measure vocabulary knowledge. The test was given according to 





1Eurich, A. C. and Haggerty, M. E.: The Minnesota Reading Examination for 
College Students. Minneapolis: University of Minnesota Press, 1935. 
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the standard directions except for the following modifications: (1) The 
responses were given orally and recorded by the experimenter. (2) A 
response was required for every item. (3) Amount done in five 
minutes was recorded. (4) Time taken to finish the one hundred items 
was recorded. 

One hundred college students were subjects in the experiment. All 
testing was done individually. 

The basic data of the experiment are listed in Table I. Association 
time for the hard words is slightly longer than for easy words. A key 
was made out for scoring comprehension of associations to the hard 
words. Associations that were synonyms, antonyms, definitions, or 
were logically connected in usage with the stimulus were considered 
comprehending. Doubtful cases were decided by the three writers in 
conference. Note that all three scores on Form A of the vocabulary 
test are approximately equivalent to those on Form B. 


TaBLeE IJ.—RELIABILITY COEFFICIENTS FOR MEASURES USED IN Part I 








on Method of i Whole 

computing test: r 

Easy word association: Time............ Split-half .722 . 839 
Hard word association: Time............ Split-half . 806 .893 
Hard word association: Comprehension. . .| Split-half .690 .817 
Vocabulary: Score in 5 minutes.......... Form A vs. Form B| .871 .871 
Vocabulary: Score unlimited time........ Form A vs. Form B| .942 .942 
Vocabulary: Total time................. Form A vs. Form B| .863 .863 














Reliability of the measures are given in Table II. Where the split- 
half method of computing reliability was employed, the S-B formula 
was applied to predict reliability of the entire test. All reliabilities are 
highly satisfactory, ranging from .817 to .942. 

Intercorrelations are listed in Table III. The coefficient of .68 
between speed of association for easy and hard words indicates a 
substantial but not high relationship. Speed of association, for both 
easy and hard words, shows a slight but significant correlation with 
comprehension of hard words and with vocabulary knowledge. The 
coefficients range from —.25 to —.40. Thus those with high standing 
in comprehension and in vocabulary scores tend to have faster associa- 
tions. The total time for the vocabulary test shows: (1) An insignifi- 
cant correlation with comprehension of hard words, (2) a slight but 
significant correlation with speed of association and with vocabulary 


ee eee aes 





# 
# 
| 
vi 4 
| 
a 


: 
r 
; 
Ny 


578 The Journal of Educational Psychology 


score in unlimited time, and (3) a relatively high correlation (—.78) 
with vocabulary score in five minutes. Apparently rate of work on the 
vocabulary test is diagnostic of power achieved under a restricted time 
limit. Also the vocabulary score in five minutes is fairly indicative of 
score in unlimited time (r = .74). 


TaBLE III].—INTERCORRELATIONS FOR MEASURES IN Part I 
(Form A of Vocabulary Test Used) 























Measure m | mr| iv] v | Vocabulary: 

Total time 
I. Easy word association: Time........ + .68) — .25) — .34/— .39 +.27 
II. Hard word association: Time........|..... — .30) — .38)— .40 + .26 

III. Hard word association: Comprehen- 

EE TT Te ee ee. Spee + .48)+.61 —.17 
IV. Vocabulary: Score in 5 minutes......|}.....|.....]..... +.74 — .78 
V. Vocabulary: Score unlimited time....|...../.....].....]....-. — .37 





The correlation in which we are most interested is that between 
comprehension of hard words in the association test and the vocabulary 
scores. Here the coefficients are .48 between comprehension and 
vocabulary score in five minutes and .61 between comprehension and 
vocabulary score in unlimited time. While the relationship indicated 
by these coefficients is substantial, it is not high enough to be of practi- 
cal value. In other words, the validity of the association test is not 
satisfactory as a measure of vocabulary knowledge. Scores on this 
vocabulary test correlate .83 with the Minnesota C.A.T.! If the 
vocabulary test is considered to be a measure of intelligence, then, the 
correlations of .48 and .61 with hard word comprehension agree fairly 
well with those (.47 to .52) cited in an earlier experiment (Tinker, op. 
cit.). 

The data indicate quite clearly that hard-word comprehension as 
measured by associative responses is not a valid measure of either 
vocabulary knowledge or of intelligence even though the reliability of 
response is high. A truer picture of the validity of the free-association 
test as a measure of vocabulary knowledge might be obtained if the 
material in the free-association test were strictly comparable to that in 
the vocabulary test. This has been done in Part II. 





1Eurich, A. C.: The Reading Abilities of College Students. Minneapolis: 
University of Minnesota Press, 1931. 
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PART II 


It is firmly established that there is no general reading ability. On 
the contrary, there seems to be a number of reading skills that are 
rather specifically limited to the kind of material being read. Ina like 
manner, it may be that vocabulary knowledge as measured by the 
free-association method is limited pretty much to the words used or to 
strictly comparable word series. In this part of the experiment the 
plan is to investigate speed and quality of associations in relation to 
vocabulary knowledge when the stimulus words are identical in both 
the association and vocabulary series. 

A list of one hundred twenty hard words were selected at random 
from Thorndike’s Word List. They all had frequencies of nineteen 
thousand or twenty thousand. From these was constructed a one- 
hundred-twenty-item vocabulary test. For each item there were four 
numbered answers, one of which was correct. A fore-exercise of five 
items and a set of directions comparable to those for the vocabulary 
test used in Part I of this experiment were provided. The test was 
divided into two equal parts, A and B. Part B was given separately 
from and after Part A. One person was tested at a time. The 
subjects were instructed to work accurately but swiftly. In each part 
a mark was made at the end of four minutes (after the last item done). 
The subject then finished the test and total time for completing the 
sixty items was recorded. Scores for Parts A and B were combined 
after reliabilities had been computed. 

For the association test each of the one hundred twenty words was 
typed on a three- by five-inch card. The words were put in the same 
order as in the vocabulary test to facilitate computation of reliability 
by the split-half method. Five practice cards were provided as a 
fore-exercise. 

The association test was given first with directions identical with 
those employed in Part I. The subject responded orally and the 
response word and time taken to respond were recorded for each item. 
The vocabulary test followed the association test. 'The number of the 
response considered correct was written by the subject in a parenthesis 
opposite the item. Both the association and the vocabulary tests were 
administered at the same sitting to prevent looking up any words in 
between. The scoring of associative responses for comprehension was 
like that described in Part I. 
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High-school seniors, college freshmen and college sophomores were 
employed as subjects. There were one hundred in all. 

The basic scores are presented in Table IV. The time taken for 
responses in each test is considered to be a measure of speed or rate of 
work. The vocabulary score. when the subjects were able to try all 
items is about fifty per cent greater than for the eight-minute time 
limit. When the free-association technique was used, thirty-seven and 
nine-tenths out of one hundred twenty responses on the average were 
scored as comprehending. This is to be contrasted with the mean 
score of sixty-six and six-tenths in the vocabulary test. Note that the 
variability for time of free-association is comparatively large. 


TaBLE IV.—MEANS AND STANDARD DEVIATIONS OF THE MEASURES EMPLOYED 














IN Part II 
Measure Mean score SD 
Vocabulary: Score in 8 minutes..................... 44.1 items 18.4 
Vocabulary: Score unlimited time................... 66.6 items 18.8 
TS CD occ ca cccccceccvecvvcacces 15.2 minutes 4.3 
Association: Comprehension....................... 37.9 items 18.9 
I Cs ccd ccevecawerccocwedesnes 438.0. sec. 201.1 





Reliability coefficients are listed in Table V. The reliability of 
scores for the whole test is highly satisfactory in each instance. The 
coefficients (S — B) range from .925 to .974. 


TaBLE V.—RELIABILITY COEFFICIENTS FOR MeEasures UsEp 1N Part II 
(Method of Computing = Split-half) 








Whole 

Measure r oh 

Vocabulary: Score in 8 minutes..................cceeceeees 865 . 928 
Vocabulary: Score unlimited time.......................4.. .861 .925 
Nee Ne ss in waco CAue dS ROO OMe .881 .937 
Association: Comprehension.................ccce cece ceees .889 .941 
ee a alee enh ew bal aw eeeeee bam .950 . 974 











Intercorrelations are given in Table VI. As in Part I, the vocabu- 
lary scores in restricted time are highly diagnostic of scores in unlimited 
time: r = .866. The rate of work on the vocabulary test is intimately 
related to score in restricted time: r = —.832. Also time taken in the 
vocabulary test shows a slight but significant correlation with both 
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comprehension and speed scores in free association. Although speed of 
association reveals no significant correlation with vocabulary score in 


restricted time it does show a substantial relation (r = —.550) to score 
in unlimited time. The correlation between speed and comprehension 
in associative responses is low but significant: r = —.295. Thus we 


see that the person with fast associations tends to score higher in 
vocabulary knowledge whether measured by a recognition technique 
(vocabulary test) or by free association. 


TABLE VI.—INTERCORRELATIONS FOR MEASURES IN Part II 








Association: 

Measure II III IV Total time 
I. Vocabulary: Score in 8 minutes........ + .866) — .832)+ .763 — .128 
II. Vocabulary: Score unlimited time......|...... — .334/+ .830 — .550 
TER. WS IE GIs vine cc cc cece cclescscdivceses — .296 + .400 
IV. Association: Comprehension.........../......).....c)eeeee — .295 

















Does the degree of comprehension in associative responses reveal 
the vocabulary knowledge of the subject? The correlations of .763 and 
.830 between comprehension scores in free association and vocabulary 
scores give an affirmative answer to the question. The coefficient of 
.830 undoubtedly gives the truer picture of the relationship, since 
responses to the whole list of one hundred twenty test items is repre- 
sented in both the scores correlated. Thus, if the stimulus words in 
the association test are the same as the items in the vocabulary test, the 
free-association technique is a valid measure of vocabulary knowledge. 
This is true even though the mean free-association score is only 37.9 in 
comparison with the vocabulary score of 66.6. 

Application of the free-association technique to measure either 
vocabulary knowledge or intelligence seems to be rather limited. The 
results in Part I reveal a substantial but not high correlation between 
free association and a general vocabulary which is good measure of 
intelligence. It is only when the stimulus words in the free-association 
test are identical with the items in the vocabulary test that this inter- 
correlation is high. Apparently whatever is measured when associa- 
tive responses to hard words are scored for comprehension, the results 
are rather specifically tied to the particular words employed as stimuli. 
In other words, the responses measure individual differences in vocabu- 
lary knowledge for words in the stimuli series but not for vocabulary 








582 The Journal of Educational Psychology 


knowledge in general. Contrary to the suggestion of Conrad and 
Harris,' the free-association test does not seem adequate as a disguised 
measure of intelligence. In situations where a disguised vocabulary 
test is desirable, however, the free-association test may be used with the 
knowledge that it has high validity. 

It might be possible, with a more adequate selection of stimulus 
words, to construct a free-association test that would be a valid meas- 
ure of general vocabulary and thus, perhaps, of verbal intelligence. 


SUMMARY AND CONCLUSION 


(1) The relation of speed and quality of association to vocabulary 
knowledge was studied in two experiments: (a) In Part I speed of 
association to fifty easy words and speed and quality of associative 
responses to fifty hard words were compared with vocabulary knowl- 
edge measured on a standardized test, the items of which did not 
appear in the free-association test. (b) For similar comparisons in 
Part II, one hundred twenty hard words made up the items in both the 
free-association test and the vocabulary test. 

(2) In both parts of the experiment there was a slight but signifi- 
cant correlation between speed of association and vocabulary knowl- 
edge. Those with higher vocabulary scores tended to have faster 
associations. 

(3) When the items in the free-association test were different from 
those in the vocabulary test (Part I), a substantial but not high correla- 
tion was found between quality of associative responses and vocabulary 
knowledge. 

(4) When the items in the free-association test and in the vocabu- 
lary test were identical (Part II) there was a high correlation (+.83) 
between quality of associative responses and vocabulary score. 

(5) Our results justify the conclusion that the free-association 
technique may be employed to measure specific vocabulary knowledge. 
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RELIABILITY OF MULTIPLE-CHOICE MEASURING 
INSTRUMENTS AS A FUNCTION OF THE 
SPEARMAN-BROWN PROPHECY FORMULA, I 


H. H. REMMERS, RUTH KARSLAKE AND N. L. GAGE 


Purdue University 
I. THE HYPOTHESIS | 


The Spearman-Brown prophecy formula was developed to provide 
an estimate of the effect of change in a test’s length on its reliability. 
It is now also well-known that, in many situations, human judgments 
behave as do test items, in the sense of the Spearman-Brown formula.*® 

In his study of attitude measurement Likert? found that when an 
arbitrary numerical scale was attached to each of the items of a 
Thurstone attitude scale so that the subject has to choose one of five 
degrees of agreement-disagreement instead of two choices (endorse- 
ment or non-endorsement), as in the Thurstone attitude scales, the 
reliability of measurement was increased appreciably. In seeking a 
rational basis for this result, it occurred to the senior author that this 
increase in reliability with an increase in the number of choices might 
be a function of the Spearman-Brown prophecy formula. By a rather 
obvious step this reasoning was then extended to all multiple choice 
measuring instruments, as follows: 

Consider two multiple choice measuring instruments, A and B, 
which differ only in the number of response alternatives presented in 
each item. (In all other respects, such as number and difficulty of 
items, plausibility of response-alternatives, etc., the two instruments 
are thoroughly equivalent.) If A’s number of response alternatives is 
different from B’s, then A’s reliability will be predictable from B’s by 
the Spearman-Brown formula, 


a rs 

~ T+ (n— Drs 
where r, equals reliability of A, 

where rz equals reliability of B, and 


n = “4 — number of response alternatives in each item of A 





TA 





Ne number of response alternatives in each item of B 


Thus, the reliability of a three-response test can be predicted from that 

of a two-response by substituting 1.5 (equals 34) for n and the two- 

response reliability for rzin the formula. Similarly a five-response test 
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is considered to be 1.667 times as long as an otherwise equivalent three- 
response test. 


II. METHOD 


The present paper reports a study of the validity of this hypothesis 
in terms of published coefficients of reliability obtained for a variety of 
achievement tests under conditions seemingly appropriate to the 
problem of the present study. The published reliabilities selected as 
suitable* for this purpose were those given by Ruch and Stoddard,’ 
Ruch and Charles, and Ruch, DeGraff and others.’ In each case, 
several tests had been designed with the same items, differing only in 
the number of response alternatives. That the purposes of these 
investigations were not those of the present study was immaterial, 
insofar as the conditions met those required by the latter. Reliabilities 
obtained for the different tests were given by these investigators. The 
present study used these reliabilities as a basis for prediction with the 
Spearman-Brown formula of the expected reliabilities if the tests were 
to be increased by given amounts. Since the studies gave obtained 
reliabilities for two-, three-, four-, and (sometimes) seven-response 
items, it was possible to compare the r’s predicted by the formula with 
those actually obtained. 

In the following tables are given the results of the comparisons. In 
each case, the first column, nz, gives the number of response-alterna- 
tives on which the original reliability was based. The second column 
gives the increased number of responses, 4, to which the reliability was 
predicted. The third column gives the reliability, rz, of the original 
test with ng response-alternatives. The fourth column gives the 
predicted reliability, ra, of the test with n4 response-alternatives. In 
the fifth column is given the obtained reliability, rox, of the test with n4 
response-alternatives. In the sixth column is the critical ratio,+ CR , 





*Such studies as that by Sims and Knox® were unsuitable because the tests 
with different numbers of response alternatives were all given to the same group 
of subjects. The differences in reliability coefficients thereby obtained could not 
be evaluated by available methods, which require that the correlation coefficients 
whose difference is being tested for significance be obtained from independent, 
random samples of subjects. [Cf. Lindquist, E. L.,* p. 217.] 

t Using Fisher’s! z-transformation, 


fr, — 2 
CR = ——_*' 


F (#4 — #0) 


However, it has been assumed in computing © (te 4 — #rey,) that o., , which is the 
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of the difference between obtained (r.:) and predicted (r,) reliabilities 
of tests with the increased (n4) number of alternatives. 

A typical row of Table I is interpreted as follows: From a test with 
two response-alternatives and with reliability .737, the reliability for 
the same test with three response-alternatives is predicted to be .808. 
The obtained reliability for the test with three response-alternatives is 
598. The critical ratio of the difference between an r of .808 and an 
r of .598, when N equals 135, is 3.49. Hence the predicted r is signifi- 
cantly different from the obtained r. 


III. RESULTS 


The test used in Table I was a general information test based on 
history and social science material. It was designed especially for the 
Ruch-Stoddard study and contained two forms, A and B, of fifty items 
each. Each item was stated in five different formas as follows: 

(1) Recall (single blank completion type) 

(2) Five-response 

(3) Three-response 

(4) Two-response 

(5) True-false 

The subjects were five hundred sixty-two high-school seniors, the 
entire senior classes of twenty-four Iowa high schools. For the pur- 
poses of the investigation the names were alphabetized for each school. 
Four groups—A, B, C, and D—with one hundred thirty-five pupils in 
each, supposedly equal in ability, were formed in the following manner: 

Group A consisted of the first quarter of the alphabetical list from 
each school. 


Group B consisted of the second quarter of the alphabetical list 
from each school. 


Group C consisted of the third quarter of the alphabetical list from 
each school. 


Group D consisted of the fourth quarter of the alphabetical list from 
each school. 





standard error of the z transformation of an r predicted by the Spearman-Brown 
‘ paler : 
prophecy formula, is equal to N 7.e., no correction analogous to Shen’s® 





has been made for the error in r due to Spearman-Brown prediction. Such correc- 
tion would presumably increase o,,—#,,) and reduce the critical ratios. Hence 


the present critical ratios probably err slightly in a direction opposed to our 
hypothesis, 
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TaBLeE I.—Rvucu-Stropparp DaTa 


























N = 135 
NB NA TB TA T obs CR of ra — Trove 
2 3 . 737 . 808 .598 3.49 
2 3 .682 . 763 .675* 1.46 
3 5 .598 .713 . 796 1.62 
3 5 .675 .776 .775* 0.00 
2 5 . 737 .875 . 796 2.19 
2 5 .682 843 745" 1.62 
* Indicate r’s based on scores corrected for chance by the formula, 
Ww 
Score = R — me 


All four groups were given the recall test on the first day. The next 
day Group A was given the five-response test; Group B, the three- 
response test; Group C, the two-response test; Group D, the true-false 
test. 
wi iIn each case both forms, A and B, of each were given to the group 
taking it, and the scores on the two forms were used to compute the 
reliabilities. 

Yor the purposes of the present investigation, it was necessary to 
consider only multiple choice items and so the data on the recall and 
true-false tests are omitted. It will be noted that if a critical ratio of 
2.5 or above is considered indicative of significance, the critical ratios in 
Table I indicate statistically insignificant differences between predicted 
and obtained ’s in five out of the six cases. 


TaBLe II.—Rvcu-CuHaries Data 











N = 188 
NB NA TB TA Tobt CR of ra — Tov: 
2 3 477 .578 .624 .67 
3 5 .624 .735 .680 1.06 
2 3 .527 .626 .619* .10 
3 5 .619 .731 .647* 1.54 
2 5 4977 .695 .680 .29 
2 5 .527 . 736 .647* 1.63 

















* Indicate r’s based on scores corrected for chance by the formula, 


Score = R — W 
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The test used in Table II was composed of one hundred questions 
based on Woodworth’s psychology text. Five tests were prepared, 
each consisting of two forms, A and B, of fifty items each. This 
battery of tests consisted of one each of the following types: Recall, 
five-response, three-response, two-response, true-false. 

The subjects consisted of seven hundred forty-seven college stu- 
dents. On the first day of testing, the recall test was given to the 
entire group. On the second day, the group was divided into four 
sections, numbering one hundred eighty-two, one hundred eighty-eight, 
one hundred eighty-eight, and one hundred eighty-nine students, 
respectively. To the first group was given the five-response test; to 
the second, the three-response; to the third, the two-response; and to 
the last, the true-false. 

Both forms of each test were given, and reliabilities computed on 
the basis of their separate scores. The scores were corrected for chance 


by the formula, S = R — v7 and reliabilities again computed. 


To provide for uniform conditions of prediction, the present study 
used an N of one hundred eighty-eight as most closely approximating 
the actual number in each group. 

It will be noted the critical ratios of the differences between 
obtained and predicted r’s support the hypothesis in all six cases. 

The subject-matter of the tests used in the study reported in Table 
III was United States history. Originally parallel forms of one hun- 
dred questions each, equal in difficulty, were made of a recall test. 
Form these were formed the same number of questions for a seven- 
response multiple choice test. Additional alternatives of response 
were dropped to give three other tests having items of five-response, 
three-response, and two-response alternatives, respectively. A true- 
false test was also made composed of these same items. One-half of 
the papers carried ‘‘guess’’ instructions; one-half, “‘do not guess” 
instructions. 

The subjects were twenty-four hundred fifty-three pupils from 
grades VII, VIII, XI, and XII of schools of Minnesota, Illinois, 
Oklahoma, Iowa, Missouri, Texas, Arizona, and California. 

At the first sitting, Form A of the recall test was given all pupils; 
Form B, at the second sitting. At the third sitting, some one of the ten 
recognition tests was given each pupil, with an attempt to achieve a 
random distribution of types. The N’s for each type ranged from two 
hundred twenty-seven to two hundred eighty-one. For the present 
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study, an N of two hundred forty was chosen as most nearly approxi- 
mating that for the group as a whole, in order to hold conditions of 
prediction constant. Reliabilities were computed on the basis of the 
alternate forms, since the pupils had been paired throughout the three 
sittings. The r’s based on subjects instructed not to guess are under- 
lined. The critical ratios of Table III indicate significant differences 
between obtained and predicted r’s in eleven out of a possible twenty 
cases for this data. 

In summary it was found that the critical ratios supported the 
hypothesis in a total of twenty out of a possible thirty-two cases, for all 
three studies combined; in other words, the formula predicted with 
reasonable accuracy in about sixty per cent of the cases. 


TaBLeE III.—Rvucnu-Decrarr Data 


























N = 240 
NB Na TB TA Tobt CR of ra — row 
2 3 . 859 .901 .886 .87 
2 3 745 .814 .837 .76 
3 5 . 886 . 928 .862 3.70 
3 5 .837 .895 . 864 1.52 
5 7 . 862 .897 .886 .65 
5 7 . 864 . 899 .800 4.03 
2 3 .843 .890 .890* .00 
2 3 . 864 .905 .858* 2.29 
3 5 .890 .931 .882* 3.16 
3 5 .858 .910 .902* . 54 
5 7 . 882 .913 .907* 44 
5 7 . 902 . 928 .839* 4.57 
2 7 .745 911 .800 4.68 
2 5 745 .880 . 864 . 76 
2 7 . 864 .957 .839* 7.61 
2 5 . 864 .941 .902* 2.94 
2 7 .859 .955 .886 5.33 
2 5 . 859 . 938 .862 4.57 
2 7 843 .949 .907* 3.37 
2 5 . 843 .931 .882* 3.16 
* Indicate r’s based on scores corrected for chance by the formula, 
Ww 
Score = R — i. 


IV. DISCUSSION 


On the face of the preceding data the original hypothesis is neither 
completely corroborated nor refuted; the weight of the evidence favors 
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it. If one holds to the original hypothesis it becomes necessary to 
attempt to account for the observed discrepant cases. Among the 
factors which might account for them are the following— 

(1) The possibility of computational errors in the data examined: 
The opportunities for these were many, particularly in the earlier 
published correlation studies, when methods of checking were not as 
well developed or known as at present. 

(2) Correction for ‘‘chance”’ in scoring the tests: Sometimes this 
was done, sometimes it was not. A paper is in preparation reporting 
experimental data showing that when corrected for chance mean scores 
vary systematically with the number of alternative choices in test 
items. 

(3) The possibility that in reducing the number of alternative 
choices a non-random procedure was followed or at any rate some 
procedure resulting in choices conflicting with the conditions of the 
Spearman-Brown formula: This possibility is further indicated by the 
fact that in twenty-eight of the thirty-two predictions the predicted r 
was greater than the obtained r. Such consistent over-prediction is 
also obtained whenever a test is lengthened by the addition of material 
non-equivalent to that already present. 

(4) Possible deviations from standard procedure in administration 
of tests when this was done by more than one person in different places 
and at different times: The data in Table III are particularly open to 
this suspicion, sampling as they do widely distributed geographical 
areas in which various persons administered the tests. 

In sum, the hope that the original hypothesis might be adequately 
tested from already published data and without resort to further and 
more careful experimental controls has proved to some extent futile. 
Experimental control of the various possible factors listed above is 
necessary. Accordingly, experiments are in process so that adequately 
guarded conclusions concerning the issue may be drawn. 
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THE ELLIS VISUAL DESIGNS TEST 


LOUISE WOOD AND EDYTHE SHULMAN 


Judge Baker Guidance Center, Boston 
INTRODUCTION 


Experience with school difficulties and other types of learning 
problems has shown that visual memory is one of the important fac- 
tors to be considered. Unfortunately, the clinical psychologist has 
been hindered in testing this function because of the scarcity of stand- 
ardized visual memory scales. In an earlier search for such material, 
the Ellis Visual Designs Test was found to be the only graded measure 
of visual memory available. Although the test was published some 
years ago,' it was never standardized; and the lack of a scoring scheme, 
as well as of norms, made it quite unsatisfactory. Consequently, 
accumulated data was analyzed in order to make the test more prac- 
tical for clinic purposes. The present study is, therefore, an outgrowth 
and continuation of what was originally undertaken without a view to 
publication. 

The scale consists of a series of ten geometrical figures, including 
the two Binet-Simon memory figures. Each design is printed on a 
separate card, five inches square. The series in its original order is 
shown in Fig. I, the dimensions having been reduced to one-half size. 


SUBJECTS 


The subjects used for the standardization of this scale were drawn 
from two sources, a child guidance clinic and a suburban school sys- 
tem. For a number of years, the series was administered to children 
studied at the Judge Baker Guidance Center whenever the routine 
psychological examination period allowed. Because of the slowness of 
collecting data in this way and the difficulties encountered in obtaining 
sufficient numbers at each age level, the school group was added. 

Cases referred to the Guidance Center for study come from all 
walks of life and in socio-economic status represent a fair cross-section 
of the population. Their problems are widely varied and include many 
educational and mild behavior problems. The selection of the clinic 
group was made at random from the clinic intake, except for the fol- 





1 Bronner, Healy, et al.: Ellis Visual Designs Test, A Manual of Individual Tests 
and Testing. Boston: Little, Brown & Co., 1927, p. 172. 
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lowing restrictions. Early experimentation showed the test to be too 
difficult for most seven-to-eight-year olds. The age range of the sub- 


THE ELLIS VISUAL DESIGNS 


PTY 2 











































































































aber 
=i 


Fie. I.—The Ellis visual designs. 


jects was, therefore, limited to those from eight years and six months to 
seventeen years and five months, the upper bounds being determined 
by the eighteen year clinic age limit. Children with Stanford-Binet 
1Q’s of less than 80 were ruled out as were those children known or 
suspected of having head injury, epilepsy, encephalitis, psychotic 
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symptoms, or psychopathic personality. Likewise, no children with 
visual, motor, or other physical handicaps likely to interfere with 
performance on the test were included. 

The school system from which the school group was chosen is in a 
large community with a varied population, including laboring as well 
as middle and upper classes. The children were selected from different 
schools on the basis of age and grade placement. Recent Stanford- 
Binet or Kuhlman-Anderson ratings were available for a large number. 
All grading below IQ 80 were excluded. Where recent test results 
could not be obtained, care was taken to avoid the possibility of includ- 
ing those below normal in intelligence by eliminating children who were 
two or more years retarded in grade for their age and those who, by 
their teacher’s report, were suspected to be below normal. In the 
schools, the Ellis Figures were given individually under standard 
conditions apart from the classroom. Table I gives the distribution of 
numbers by age and sex of school and clinic cases. 


TaBLE I.—DIsTrRIBUTION oF SUBJECTS 




















Boys Girls 
Clinic School Clinic School 
Age cases cases cases cases 
Total Total 
cases cases 
N Per N Per N Per N Per 
cent cent cent cent 
8/6— 9/5 83 21 | 25.3 62 | 74.7 60 12 | 20.0 48 | 80.0 
9/6-10/5 73 24 | 32.9 49 | 67.1 66 14 | 21.2 52 | 78.8 
10/6-11/5 119 65 | 54.6 54 | 45.4 71 29 | 40.8 42 | 59.2 
11/6-12/5 125 63 | 50.4 62 | 49.6 93 36 | 38.7 57 | 61.3 
12/6-13/5 94 54 | 57.4 40 | 42.6 71 24 | 33.8 47 | 66.2 
13/6-14/5 109 66 | 60.6 43 | 39.4 84 34 | 40.5 50 | 59.5 
14/6-15/5 102 66 | 64.7 36 | 35.3 92 38 | 41.3 54 | 58.7 
15/6-16/5 101 60 | 59.4 41 | 40.6 104 56 | 53.8 48 | 46.2 
16/6-17/5 102 67 | 65.7 35 | 34.3 97 49 | 50.5 48 | 49.5 
Total..... 908 | 486 | 53.5 | 422 | 46.5 738 | 292 | 39.6 | 446 | 60.4 



































In all, sixteen hundred forty-six subjects were used, nine hundred 
eight of them being boys and seven hundred thirty-eight girls. For the 
boys, the numbers at each age level above the two youngest approxi- 
mate or exceed one hundred. The numbers of girls at each age level 
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but one are consistently less. They do, however, approximate or 
exceed seventy-five at all but the youngest ages. No fewer than sixty 
cases were obtained at any age level. 

The numbers of school and clinic children were not balanced. The 
school group comprises slightly more than half of the total group of 
subjects, but the proportions at each age level vary. The boys’ 
group contains proportionally fourteen per cent fewer school cases than 
the girls’. In general the numbers of clinic children of both sexes are 
fewer in the lower age ranges, and increase fairly regularly with the 
ages of the groups. The percentages of boys from the clinic vary from 
twenty-one per cent at the 8/6 to 9/5 level to sixty-six per cent at the 
16/6 to 17/5 level. For the girls, the lowest percentage of clinic cases 


was twenty per cent at the 8/6 to 9/5 level and the highest, fifty-four 
per cent in the 15/6 to 16/5 group. 


PRESENTATION AND SCORING 


The ten memory designs were presented in the order established by 
Ellis. ‘The form of his very general directions was likewise followed but 
made more standard. The examiner introduces the test with the 
words: ‘‘I am going to show you some little designs one ata time. You 
will see each one for just five seconds and then I shall take it away and 
ask you to draw from memory what you have seen. Look at it care- 
fully so you can make yours just like it.”” Paper and pencil are given. 
Then the first design is presented. ‘‘Here is the first one. Ready!” 
The stop watch is started as the first card is laid flat on the desk in 
front of the child. It is removed at the expiration of the five seconds 
by placing the hand directly over the design to obscure its view as 
promptly as possible. ‘‘Now draw it just like the picture.” The 
remaining designs are presented in order in the same way. 

The plan of scoring was developed quite arbitrarily on the pattern 
of the half-point scoring used in the Stanford-Binet for the ten-year 
memory designs. After a good deal of experiment and experience with 
the figures, this system of half credits was adopted as the most satis- 
factory from the standpoint of objectivity and ease and rapidity of 
scoring. In outline, the scoring scheme is as follows: Credit for one 
point is given for each design correctly drawn; a half point credit is 
allowed for each design containing only one error or two errors which 
are symmetrically consistent. Reversing, inverting, or turning the 
design on a ninety degree angle is considered one error. Two or more 
errors score zero. Errors obviously due to poor motor control are 
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disregarded. The credits gained on all ten designs are totaled for the 
final score on the test. Samples of half credits for each design may be 
seen in Fig. II. 
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P Fic. II.—Samples of half-point credits. 








In order to determine to what extent the discriminative value of the 
test could be increased by a more detailed system of credits, the results 
of two hundred of the school cases were scored in several different ways. 
First, the errors made on each design were charted. On the basis of 
these, various scoring systems were devised. The allotment of three, 
four, or five point credits for each figure was successively tried. In 
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spite of the increased range and somewhat greater individual dif- 
ferentiation, the means for each age computed on the basis of each of 
these scoring systems showed proportionally no greater age differentia- 
tion than those gained from scoring by the simpler half-point method. 
For this reason there seemed little advantage in adopting a more 
detailed and unwieldy system. 


TREATMENT OF DATA AND RESULTS 


The designs drawn by the sixteen hundred forty-six subjects were 
scored by the method described. Table II gives the frequencies of the 


total credits on all ten designs by age and sex, with the mean and 
standard deviation for each group. 


TaBLeE II.—FREQUENCIES OF TOTAL CREDITS 














8/6 to | 9/6to | 10/6to | 11/6 to | 12/6 to | 13/6 to | 14/6 to | 15/6 to | 16/6 to 
9/5 10/5 11/5 12/5 13/5 14/5 15/5 16/5 17/5 
Score 
Bi\G|B\G\Bi\G|B\G|BIG|BIGiBIiG|BIil@a|Bi¢é 
0 0} 0; OF Of} OF 0}, Oo; OF OF OF Of of OF OF Of Oo} OF 0 
1 0} Oo} OF 0} oO] Of a) 0} 0} O| OO} Of} OF} Of oO} Of} Of lO 
1.5 1} 1) of 1} of of of of of of of of of of of of of o 
2 3} 4) O| Oo; of Oo} of of} of ol 0 0} 0} 1) of OF 0 0 
2.5 1} 1) He as oat oO} Of} OF cof cof Of of} Of Of Of Of OF o 
3 2; 1} 4) 3] 3! of 2} 2| af of af of of of of of of io 
3.5 5} 9} 6 6 «S| 68} 68} 6} «Cok Oo} ds a} Cok) 6 fd 6) of 
4 li; 4) 5| 6 7 7 si 2] a at 4 oat lo 68} ht 6} oa lo 
4.5 16} 8| 7} 9 10) 3} 4) 3| of a] 9} of 6B} hU} 68} h6U} 8 2 
5 14) 6 6 6) 14 6 OF 8} 8 2 6 4) of af 4 af al 3 
5.5 7| 8} 10} 12| 11] 10} 15} 13} 10) 7} 4) 5| 7 si 3] 7 (4 (8 
6 9} 5| 13) 9] 14) 11] 21] 12) 10} 9) 8 sg 4} 13) | 13] 8] 12 
6.5 5} 4| 5| 4| 29] 16; 18] 15) 11] 12} 9| 13] 10) 14) 12| 15) 6) 15 
7 2} 5; 6 1] 10] 6} 9} 2| 15] 14) 10; 14] 14) 8] 10) 15) 14) 12 
7.5 2} 1) 2} 5] 10) 7| 12) 14) 11] a4} 18] 11] 17] 14) 17] 7] 18) 17 
8 5} 2) 5| 2) 6 2) 9 11] 10) 6] 14) 15) 19] 4) 9) 19) 9} 15 
8.5 o} oO; 2) 1) 6 O| 7 3) 64) «68) 17; «7 14) 13] 13] 13] 13] 10 
9 o} 1; 1) oO} 3} (Of 68} h6U8} 64} hl} 64 8} og} 8} 10) of 15) 2 
9.5 0} 0} Oo} 0} of oO 3 3 2 OF 38) a} Of} 4 7 6a} of 2 
10 0} of Oo} Of OF} Of Of OF a} Of a} Of OF OF 1) Of B 2 
Wet W......0.5 0000 83; 60| 73) 66} 119} 71| 125) 93] 94) 71| 109) 84] 102} 92| 101) 104) 102) 97 
Mean score........... 4.98/4.93/5.56|5.18/5.98|5.88|6.26|6.49/6. 65/6. 79|6.96|7.02|7.26/6.93|7.33/7.19|7.57|7.12 
RES: 1.4} 1.6) 1.5] 1.4] 1.5) 1.2] 1.5] 1.5) 1.4) 1.1) 1.6] 1.2] 1.2] 1.4] 1.4] 1.2] 1.5) 1.2 



























































Here it may be seen that, while there are certain irregularities, 
somewhat greater for the girls than the boys, the scores fall into fairly 
normal age distributions and show a definite age progression. Because 
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| of the circumscribed range of the test and the small scoring units, the 

age progression of the means necessarily appears slight. However, the 

figures do suggest the tendency to a greater difference between average 
Chart I - Percentile Curves 
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- scores at the younger ages and relatively less differentiation between 
J : the older age groups. The standard deviations are approximately the 
same for the different age groups of both boys and girls and indicate a 
p moderate degree of variability in scores at each age. 
y 6 The medians, twenty-five percentiles and seventy-five percentiles 
. were computed for each age level for boys and girls separately. The 
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TasBLe III.—SmoorHep PERCENTILE ScoREsS 











Age 
Per- 
cen- | 8/6 | 9/6 | 10/6 | 11/6 | 12/6 | 13/6 | 14/6 | 15/6 | 16/6 
tile to to to to to to to to to 
9/5 | 10/5 | 11/5 | 12/5 | 18/5 | 14/5 | 15/5 | 16/5 | 17/5 
Boys...| 75 6.1} 6.7] 7.3] 7.6] 8.0] 8.2] 8.5] 8.7] 9.0 
50 8.1] 6.81 6.3) 6.61 7.1) 7.41) 7.71 7.81 7.9 
25 4.4| 4.7] 5.2| 5.6] 5.9| 6.2] 6.5] 6.8] 7.0 
Girls...| 75 6.3) 6.5] 7.0] 7.5] 8.0] 8.1] 8.3] 8.3] 8.3 
50 6.31 671 O31 €@71 THi TH FSR 7.381 8 
25 4.0| 4.6) 5.2} 5.8| 6.1] 6.3] 6.4] 6.4] 6.6 



































TaBLE IV.—UNSMOOTHED PERCENTILE ScorEs FOR ScHoot Group AND TOTAL 




















GRouP 
Boys Girls 
Percentile scores Percentile scores 
Group Age 
N N 

25 per- | 50 per- | 75 per- 25 per- | 50 per- | 75 per- 

centile | centile | centile centile | centile | centile 
School..... 8/6— 9/5) 62} 4.3 5.1 6.2 | 50) 4.1 §.1 6.1 
Ws i caclsdcsecees 83) 4.4 §.1 6.1 60} 3.9 5.2 6.3 
School..... 9/6-10/5| 49) 4.6 5.5 7.0 | 52) 4.3 5.7 6.4 
Es + + eae caleaa nes 73| 4.7 5.9 6.8 | 66) 4.5 5.5 6.3 
School..... 10/6-11/5) 54) 5.3 5.9 7.0 | 44 5.5 6.3 6.9 
MS i ah chon naiee 119} 5.1 6.3 7.3 | 71) 5.4 6.3 6.9 
School..... 11/6-12/5) 62| 5.9 6.6 7.0 Fae we 6.6 8.1 
La ako ahi webekee 125} 5.7 6.6 7.7 | 93) 5.7 6.7 7.9 
School..... 12/6-13/5| 40) 6.1 Be | 7.9 | 47) 6.5 7.1 7.8 
6a ila waa hci 94; 5.9 7.0 7.9 71; 6.3 | 7.8 
School..... 13/6-14/5| 43) 6.8 7.9 8.5 | 50) 6.7 7.4 8.2 
Sa eer 109} 6.1 7.6 8.4 | 84 6.5 7.3 8.2 
School..... 14/6-15/5) 36) 7.0 8.2 8.6 | 54) 6.4 7.3 8.5 
Ce io ol ss 0 aioe 102} 6.8 eh 8.4 | 92) 6.2 7.1 8.4 
School..... 15/6-16/5) 41; 6.9 8.1 9.0 | 49) 6.5 7.8 8.6 
Ee ee, 101| 6.6 7.7 8.7 |104| 6.5 7.4 8.4 
School..... 16/6-17/5| 35) 7.1 7.8 9.0 | 45) 7.0 7.9 8.5 
ha kil ae gre are 102| 7.0 7.9 9.0 | 97| 6.6 7.5 8.2 
































curves for the boys’ scores based on these figures are quite regular and 
required only slight smoothing, while those for the girls show more 
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unevenness in progression and somewhat greater adjustment was 
necessary for smoothing. ‘The details are diagramatically presented 
in Chart I. The percentile scores for the boys show a very even 
upward progression from age to age. The curves for the girls, on the 


Chart II - 
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other hand, progress quite regularly through the 12/6 to 13/5 age level, 
but tend to flatten at the upper ages. 

The smoothed percentile scores are given below for boys and girls 
separately. 

For purposes of comparison and evaluation of the group, percentile 
scores were computed for the school children alone. The figures are 
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compared by age and sex with the unsmoothed percentile scores of the 
total group in Table IV. 

The correspondence of scores between the school group and the 
total group is very great, particularly considering the comparatively 
small numbers of school children at each age level. It is, therefore, 
clear that the inclusion of the school subjects has not affected the final 
results of the standardization and that there is no appreciable difference 
in performance between the clinic and school children. 

An analysis of performance on the individual designs made it 
apparent that the original order of presentation established by Ellis 
does not correspond to the order of difficulty of the designs. The 
number of whole credits gained on each design was tabulated, as were 
the number of zero scores. Chart II gives the percentage of whole 
credits for each age. The curves show clearly that designs I and II are 
much the easiest, with no real difference in difficulty for our age range. 
Design III seems to be rightly placed in order. Designs IV, V, and VII 
follow III and seem very nearly equal in difficulty. Design VI comes 
next, VIII following after with a slight difference in difficulty at most 
ages. Designs IX and X fall together at the end as the most difficult in 
the scale with little appreciable difference between them. 

When the percentages of zero credits instead of the whole credits 
were charted, the same order of difficulty appeared. While it is 
possible that the difficulty of the individual designs is affected by the 
half-point scoring system adopted, it is significant that the designs 
group similarly in difficulty when scored by the more detailed schemes 
earlier attempted. Likewise, there is no apparent sex difference in the 
order of difficulty of the designs, for when the percentages of whole 
credits were plotted by age for boys and girls separately, the curves for 
the ten designs fell in the same order. 

No adequate study of handedness was made in connection with this 
standardization; but in administering the designs to the school group, 
it was noted which children drew with the left hand. The group 
selected on this basis is comprised of only .048 per cent of the eight 
hundred sixty-eight school children, twenty-six boys and sixteen girls. 
No attempt was made to ascertain how many of those who drew with 
the right hand had marked left-handed tendencies. Because of the 
possibility of error resulting from this and the very small numbers of 
children who drew with the left hand, no definite conclusions can be 
made as to the effect of left-handedness on the test scores. However, 
an analysis was made of this small group in comparison with the eight 
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hundred twenty sixjchildren using the right hand. The number of 
designs reversed, inverted, and turned at a right angle were tabulated. 
Designs I, II, VII, and IX, being symmetrical figures, cannot be 
reversed. Designs I, I, III, and V cannot be inverted. Only Design 
I cannot be drawn at right angles. 

Twenty-nine or sixty-nine per cent of our left-handed group and one 
hundred forty-three or only seventeen per cent of our right-handed 
group made one or more of these three types of errors. For both 
groups reversals are the commonest of the three errors, with right angle 
errors only slightly less. Inversions were comparatively rare in the 
right-handed group and were not shown at all by the left-handed 
children. Of the designs that can be reversed, VI was the only one 
which was not so drawn by any child from either group. Designs VI 
and IX were the only figures never drawn at right angles. 


DISCUSSION 


This study offers primarily a standard for the administration and 
scoring of the Ellis Visual Designs Test. No attempt has been made to 
alter or revise the original form of the material. Sufficient data are not 
at present available for a thorough evaluation of the scale as a diagnos- 
tic measure although some appraisal of its value in clinical use with 
children can be made. The questions of the test’s reliability and its 
relation to school achievement have been left for future investigation. 

The present results show that the test has an adequate range of 
difficulty. All of the designs but the first two show an age progression, 
some having wider age variation than others. They are not, however, 
well graded. Several designs are too closely alike in difficulty; they 
are not arranged in order of difficulty; and the progressive steps of 
difficulty are rather irregular. The mean test scores at successive ages 
progress upwards quite evenly and consistently, but only by slight 
increments. The percentile ratings indicate that variation in per- 
formance within the various age groups is greater than the average age 
differences. The range of possible scores, however, is too narrow for 
very sharp individual differentiation. Sex differences are slight. 
Tendencies to reversals, inversions, and other directional difficulties are 
revealed in Ellis figure drawings and appear to occur much more fre- 
quently among left-handed than right-handed children. 

While the test has the advantages of compactness, ease and speed 
of administration, and facility of scoring, there is room for a good deal 
of further research and revision. The series would be greatly improved 
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by a rearrangement of the order of the figures according to difficulty 
and the addition of new figures to fill out the scale. Some of the 
designs have, moreover, proved rather unsatisfactory and should be 
altered or be replaced by others. The inclusion of the Binet figures is a 
definite handicap, particularly in clinical use where the Stanford-Binet 
is given routinely. Figure VI in the scale has also proved to be unfor- 
tunate because it is so frequently associated with the letter K. Often 
children, on looking at the design, will remark that it is a K and, there- 
fore, not note its details. 

As it stands, the Ellis Visual Designs Test may best be used to bring 
to light visual disabilities in memory and reproduction. For this 
purpose the norms give a standard of comparison. While the per- 
centile ratings in themselves are only roughly discriminative, they do 
serve to appraise the extreme deviations in performance. In spite of 
the test’s weaknesses, it does, therefore, have a definite clinical value, 
for it is these extremes of marked ability or disability which have 
educational significance. 
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AN EXPERIMENTAL COMPARISON OF READING THE 
ORIGINAL AND DIGEST VERSIONS OF AN ARTICLE 


H. Y. McCLUSKY 


University of Michigan 
THE PROBLEM 


The amount of reading material now available for public consump- 
tion has reached tremendous proportions. The multiplicity of books 
and periodicals is so extensive that the average reader is able to cover 
only a small fraction of the total output of printed matter. In order 
to cope with this plethora of print, an increasing number of magazines 
are being devoted to the publication of summaries or digests of original 
articles appearing in other periodicals. The present array of magazine 
digests already cover not only articles of general interest but also those 
dealing with such special topics as science, health, religion, education, 
and business. Apparently the magazine digest is filling a great need 
in the field of informal adult education, and if one is to judge from 
current developments this trend will increase as the output of printed 
materials increases. 

The trend described in the preceding paragraph raises some interest- 
ing questions. What is the relative effectiveness of reading the original 
and digest version of an article? How much does the individual lose 
by reading the digest and failing to read the original article? It is the 
purpose of this article to report the results of an experiment designed 
to yield some results related to these questions. 


SUBJECTS, MATERIALS AND PROCEDURE 


The members of two classes in introductory educational psychology 
at the University of Michigan were used as subjects. Asa part of their 
laboratory work in educational psychology they had already taken the 
Otis Self-Administering Test of Mental Ability, Higher Examination, 
and the Nelson Denny Test of Reading Ability for college students. 
The results of these two tests were used to form two equivalent experi- 
mental groups by means of the technique of matched pairs. 

The materials consisted of the original and digest forms of two 
articles. The first article dealt with the general topic of nutrition and 
the mineral balance of soils. The original form of this article contain- 
ing forty-five hundred thirty-two words appeared under the title of 
“Modern Miracle Men” by Dr. Charles Northern in Hearst’s Inter- 
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national-Cosmopolitan for June, 1936. The digest form of the article 
containing ten hundred seventy-seven words appeared under the title 
of ‘‘ Health from the Ground Up” in Readers Digest for January, 1937. 
The second article dealt with problems of giving money to various 
public causes. The original article containing thirty hundred ninety- 
three words appeared under the title of ‘‘A Donor’s Dilemma” by 
Barcley Acheson in the Survey Graphic for September, 1937. The 
digest form of the article containing eleven hundred ten words appeared 
under the same title in Readers Digest for November, 1937. 

Comprehension was measured by two sets of true-false questions 
covering each article. These two sets of questions, equal in length, 
were designed to separate the influence of reading the original and 
digest forms of the article. For example, one set of questions was 
based on the material that appeared only in the digest, while the other 
set of questions was based on material appearing in the original article 
but omitted in the digest version of the article. Therefor, the person 
reading the original article would have made contact with material 
covered by both the original and digest sets of questions, while the 
person reading the digest would have made contact with material 
covered only by one (digest) set of questions. This procedure was 
employed to yield an answer to the question: To what extent will a 
person reading the digest be able to answer questions covering that 
portion of the original article which was omitted from the digest? 

The rotation procedure of experimentation was employed. For 
example, the members of Group A read the original version of the first 
article, ‘Modern Miracle Men” and the digest form of the second 
article, ‘‘A Donor’s Dilemma,” while Group B read the digest form of 
the first article, “ Modern Miracle Men” and the original form of the 
second article, ‘‘A Donor’s Dilemma.” In order to induce a proper 
mind-set toward the experiment, both groups were required to read a 
preliminary short, narrative passage followed by a set of thirty true- 
false questions before reading the experimental materials. The time 
of reading was determined by the use of a large clock with a conspicuous 
minute and second hand on the wall of the laboratory. The following 
directions were given to both groups immediately preceding the period 
for experimental reading: 


Before you are some mimeographed materials consisting of an article to be 
read and two sets of questions about the article. You will start to read 
when the signal is given. Read at your normal rate and with the attitude you 
use under ordinary conditions. When you have finished reading the mimeo- 
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graphed article, look at the clock on the wall and record the exact time in 
minutes and seconds on the back of one set of questions. After you have 
recorded the time, turn the reading material face down on the table out of 
your way and proceed to answer the questions from memory by placing a true 
or false mark in the left margin opposite each question. When you have 
finished one set of questions, continue with the other set. 


Summary Steps of Procedure for Group A 


1. Read the preliminary article in order to practice keeping time with the 
laboratory wall clock and to induce a normal reading mind-set towards 
materials and questions. 


2. Read the original form of the first article, ‘‘Modern Miracle Men”’ 
and answer two sets of questions with fifty questions in each set. 

3. Read the digest form of the second article, ‘‘A Donor’s Dilemma” 
and answer two sets of questions with forty questions in each set. 


Summary Steps of Procedure for Group B 


1. Same step as employed by Group A above. 

2. Read the original form of the second article, ‘‘A Donor’s Dilemma” 
and answer two sets of questions with forty questions in each set. 

3. Read the digest form of the first article, “‘Modern Miracle Men,” 
“Health from the Ground Up,” and answer two sets of questions with fifty 
questions in each set. 


THE DATA FOR THE MATCHED PAIRS 


About eighty college students, juniors and seniors at the University 
of Michigan, were involved in some stage of the experiment. But 
because some students were absent from some of the testing and experi- 
mental periods of the investigation, and because there was difficulty in 
securing suitable experimental partners for some subjects, a total of 
sixty-eight students or thirty-four matched pairs was finally employed 
in the results. The purposes of the investigation did not require, 
and the reliability of the tests as applied to individual results did not 
permit, the writer to form matched pairs having exactly the same 
numerical raw score for both the intelligence and reading tests. But 
an inspection of Table I will disclose that in spite of the difficulty in 
securing equal experimental partners when two variables are employed, 
for practical purposes a high degree of equivalence was secured in this 
investigation for each matched pair and that as a consequence virtually 
exact numerical equivalence was secured for the groups. Compare 


averages 62.5 and 61.7 for the intelligence test, and averages 111.2 and 
111.7 for the reading test. 
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Group A Group B 
J Individual) Intelligence Reading Individual | Intelligence Reading 

1A 45 94 1B 45 83 

2A 60 112 2B 60 114 
3A 53 126 3B 56 125 

4A 52 106 4B 58 112 

5A 63 125 5B 63 124 
6A 64 119 6B 65 117 

7A 70 154 7B 72 148 
8A 57 129 8B 56 140 
9A 63 98 9B 63 89 
10A 70 129 10B 68 127 
11A 64 115 11B 63 131 
12A 70 129 12B 74 143 
13A 51 74 13B 49 75 
14A 59 83 14B 55 87 
15A 58 54 15B 58 66 
16A 53 90 16B 50 89 
17A 61 94 17B 59 95 
18A 60 98 18B 61 104 
19A 61 100 19B 54 103 
20A 65 115 20B 65 116 
21A 69 143 21A 68 150 
22A 48 83 22B 48 77 
23A 69 127 23B 64 128 
24A 71 163 24B 73 156 
25A 56 128 25B 55 128 
26A 61 74 26B 61 70 
27A 63 104 27B 60 103 
28A 72 126 28B 72 125 
29A 71 113 29B 70 110 
30A 68 95 30B 64 96 
31A 71 126 31B 72 121 
32A 71 117 32B 68 123 
33A 69 106 33B 64 102 
34A 68 133 34B 67 128 

Average. . 62.5 111.2 Average..... 61.7 111.7 
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THE RESULTS OF THE INVESTIGATION 


Since the first article, entitled ‘‘ Modern Miracle Men,” was longer 
than the second article, ‘‘A Donor’s Dilemma,” two sets of questions 
with fifty questions in each set were used to measure comprehension of 
the first (former) article and forty questions were used in each set to 
measure the comprehension of the second (latter) article. In order to 
compare the results on both articles, the raw comprehension scores 
were converted into percentage, not percentile, scores. For example, a 
percentage score of 70 for subject 1A in Table II is the same as a raw 
score of 35, in reading the article, ‘Modern Miracle Men,’ while for 
subject 2B in Table III a percentage score of 75 is a raw score of 30 in 
reading the article, ‘‘A Donor’s Dilemma.” The comprehension 
scores in the remaining four tables are, therefore, percentage scores. 

Since saving time is one of the great assumed advantages in reading 
the digest of an article, time scores are given in terms of the total 
amount of time in seconds required to read the article. The results of 
Groups A and B in reading the first article, ‘Modern Miracle Men” 
are given in Table II. 

The averages of Table II indicate that the members of Group B 
reading the digest version of the article answered eighty per cent of the 
digest questions correctly, while Group A reading the original form of 
the article answered seventy-nine per cent of the digest questions 
correctly. A comparison of the individual scores on the digest ques- 
tions for each matched pair (compare columns 4 and 8) will show an 
advantage in fifteen pairs for Group B, an advantage in thirteen pairs 
for Group A and the same score in six pairs. 

On the other hand, Group A which read the original version shows a 
distinct advantage over Group B which read the digest version in 
answering the questions covering the original material which was 
omitted in the digest version. The advantage here is seventy-eight 
per cent as against sixty-four per cent which translated into raw scores 
is a superiority of seven points. Comparing columns 3 and 7, this 
superiority holds for thirty of the thirty-four matched pairs. Of the 
remaining four pairs, three have the same score, and only one (26A-B) 
pair shows an advantage for Group B. However, Group B read the 
digest in 305 seconds (average score), while Group A read the original 
article in 1231 seconds. In other words, Group A spent slightly more 
than four times longer in reading than Group B. The interpretation of 
these facts will be reserved for a later section. 
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Group A reading original version Group B reading digest version 
Individual — o Comprehension Individual _ = Comprehension 
Original | Digest Original| Digest 

(1) (2) (3) (4) (5) (6) (7) (8) 
1A 1178 70 72 1B 375 56 72 
2A 942 72 74 2B 250 58 86 
3A 1080 74 72 3B 360 44 74 
4A 1467 80 78 4B 240 40 86 
5. 1425 82 84 5B 480 62 78 
6A 1153 78 90 6B 285 76 74 
7A 1050 84 78 7B 300 72 88 
8A 1140 88 76 8B 300 82 76 
9A 1240 76 74 9B 420 76 86 
10A 900 72 82 10B 480 36 94 
11A 945 82 74 11B 375 64 76 
12A 1045 74 82 12B 285 74 74 
13A 1740 70 78 13B 195 64 78 
14A 1620 86 82 14B 240 72 72 
15A 960 76 72 15B 330 68 78 
16A 1200 72 86 16B 335 68 74 
17A 1320 86 84 17B 405 60 74 
18A 1620 72 68 18B 330 66 84 
19A 1080 76 70 19B 285 64 80 
20A 1200 78 88 20B 300 66 88 
21A 1102 82 82 21B 255 70 84 
22A 1500 82 90 22B 420 58 74 
23A 1260 70 88 23B 270 64 82 
24A 780 74 86 24B 210 60 84 
25A 1560 84 80 25B 255 58 74 
26A 1200 70 74 26B 360 80 82 
27A 1380 78 76 27B 240 68 66 
28A 1200 70 80 28B 285 70 92 
_29A 1080 82 82 29B 300 62 90 
30A 1620 88 88 30B 300 52 84 
31A 1400 80 82 31B 270 74 82 
32A 1305 82 72 32B 180 48 74 
33A 1200 72 70 33B 225 66 70 
34A 960 84 82 34B 240 64 80 
Average....| 1231 78 79 Average... 305 64 80 
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1 The above table is read as follows: Subject 1A in Group A read the original version of the article 
““Modern Miracle Men” in 1178 seconds. He answered seventy per cent of the fifty questions 
covering material in the original version but not in the digest version, and seventy-two per cent of the 
fifty questions covering material only in the digest version. 


subject 1A, read the digest version of the article in 375 seconds, etc. 


Subject 1B, the matched partner of 
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Taste II].—A Comparison or Turrty-rourR MatcuHep Pairs READING THE 
ORIGINAL AND D1GEstT VERSION OF THE ARTICLE ENTITLED ‘A Donor’s 



































DitemMa’’! 
Group B reading original version Group A reading digest version 
ime i Comprehension ime i i 
asses: « Bharieagemmng a Dn room hari tie oa 
Original | Digest Original | Digest 

(1) (2) (3) (4) (5) (6) (7) (8) 
1B 800 67 82 1A 345 65 77 
2B 745 87 75 2A 195 70 87 
3B 960 72 75 3A 360 70 85 
4B 900 45 55 4A 375 45 77 
5B 870 75 85 5A 315 62 90 
6B 800 80 85 6A 285 50 90 
7B 900 87 90 7A 315 75 97 
8B 1080 95 82 8A 268 50 87 
9B 1020 82 82 9A 360 75 90 
10B 1320 85 92 10A 360 82 92 
11B 825 77 85 11A 188 77 85 
12B 720 95 85 12A 300 75 85 
13B 630 70 82 13A 480 72 82 
14B 1020 72 82 14A 360 55 92 
15B 1020 72 82 15A 300 57 77 
16B 930 72 90 16A 300 65 90 
17B 1290 80 77 17A 375 57 77 
18B 900 72 90 18A 360 70 77 
19B 900 65 90 19A 300 72 70 
20B 1020 85 92 20A 300 67 87 
21B 930 77 95 21A 260 60 80 
22B 1215 67 85 22A 340 67 85 
23B 840 75 80 23A 420 57 87 
24B 810 85 80 24A 200 77 87 
25B 615 80 80 25A 300 75 80 
26B 945 77 82 26A 300 62 80 
27B 690 60 65 27A 420 67 90 
28B 990 80 87 28A 360 60 80 
29B 1005 72 92 29A 420 67 85 
30B 840 75 80 30A 600 70 90 
31B 915 77 85 31A 420 75 95 
32B 480 77 82 32A 620 52 80 
33B 690 82 77 33A 300 55 77 
34B 810 75 87 34A 240 60 82 
Average.... 895 76 83 Average... 342 65 84 











1 The above table is read as follows: Subject 1B (the same person as oubiost 1B in Table II) read 
the original version of the article ‘A Donor’s Dilemma” in 800 seconds. e answered sixty-seven 


per cent of the forty questions covering material in the original version but not in the digest version, 
and eighty-two per cent of the forty questions covering material only in the digest version. Subject 
1A (the same person as subject 1A in Table II, and the matched partner of 1B) read the digest 
version of the article in 345 seconds, etc. 
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What are the results when the positions of Groups A and B are 
reversed in reading another article? The data for Group B reading the 
original version and Group A reading the digest version of ‘‘ A Donor’s 
Dilemma”’ are presented in Table ITI. 

The averages of Table III indicate that Group A reading the digest 
version of the article answered eighty-four per cent of the questions on 
the digest correctly, while Group B reading the original form of the 
article answered eighty-three per cent of the digest questions correctly. 
A comparison of the individual scores on the digest questions for each 
matched pair (compare columns 4 and 8) show an advantage in four- 
teen pairs for Group A, an advantage in eleven pairs for Group B, and 
the same score in nine pairs. 

On the other hand, Group B reading the original version shows 
some advantage over Group A reading the digest version in answering 
the questions covering the original material which was omitted in the 
digest version. The advantage here is seventy-six per cent as against 
sixty-five per cent which translated into raw scores is a superiority of 
4.4 points. A comparison of columns 3 and 7 indicates that this 
superiority holds for twenty-eight matched pairs, while of the remain- 
ing six pairs, three have the same score and three show an advantage 
for Group A. However, Group A read the digest version in 342 
seconds (average score), while Group B read the original article in 895 
seconds. In other words, Group B spent 2.6 times longer in reading 
than Group A. 

To summarize briefly: The results of Tables II and III indicate 
practically the same trend: That there is no advantage in reading either 
the original or digest forms of an article in answering questions covering 
the digest, while there is a slight advantage in reading the original 
article in answering questions in the original but omitted in the digest 
version. This general conclusion applies to questions answered imme- 
diately after the article is read. 

What are the results if the questions are answered again one week 
after the passage was first read? One of the two classes in educational 
psychology employed as subjects in this investigation was available for 
a re-application of the comprehension questions on both articles. The 
results for fifteen of the original thirty-four matched pairs were used. 
They are labelled in Table IV with the same numbers and letters with 
whieh they are identified in Tables II and III. In order to bring out 
the difference not only between reading the original and digest versions 
of the article, but also the difference between the immediate and 
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delayed (one week after) answering of the questions, Tables IV and V 
were constructed. The results of the groups reading the article 
entitled ‘‘Modern Miracle Men”’ is given in Table IV. 


TaBLE 1V.—A CoMPARISON OF THE COMPREHENSION ScorEs OF FirrEEN MATCHED 
Parrs ONE WEEK AFTER READING THE ORIGINAL AND Digest VERSIONS OF AN 
ARTICLE ENTITLED “MopgernN Mrracte Men’?! 
































Group A reading original version Group B reading digest version 
Comprehension Comprehension 

— Original Digest | Individual| Original Digest 

Imme-!| Week | Imme-| Week Imme-| Week | Imme-| Week 

diate | after | diate | after diate | after | diate | after 
13A 70 | 58 78 | 62 13B 64 70 78 | 72 
14A 86 92 82 84 14B 72 66 72 66 
15A 76 «| 60 72 | 72 15B 68 62 78 | 72 
16A 72 | 74 86 | 70 16B 68 88 74 | 66 
17A 86 | 82 84 | 80 17B 60 60 74 | 58 
18A 72 | 68 68 | 68 18B 66 66 84 | 78 
19A 76 | 76 70 | 66 19B 64 62 80 | 70 
22A 82 {| 70 90 | 80 22B 58 72 74 | 76 
26A 70 | 72 74 | 76 26B 80 74 82 | 82 
27A 78 | 84 76 | 84 27B 68 64 66 | 62 
29A 82 72 82 82 29B 62 72 90 78 
30A 88 | 76 88 | 82 30B 52 68 84 | 78 
31A 80 | 82 82 | 80 31B 74 64 82 | 70 
32A 82 | 74 72 | 82 32B 48 52 74 | 78 
34A 84 | 80 82 | 74 34B 64 78 80 | 70 

Average...| 79 74.6 79 76.1 | Average...| 64.5 | 67.8 78 71.7 


























1 The above table is read as follows: Subject 13A, who appears in Tables II 
and III as 13A, read the original version of ‘‘Modern Miracle Men’’ and immedi- 
ately afterwards answered seventy per cent of the fifty questions covering the 
original material in the original but not in the digest version correctly, while a 
week later he answered fifty-eight per cent of the same questions correctly. He 
answered seventy-eight per cent of the digest questions immediately after reading, 
while he answered sixty-two per cent of the same questions a week later, etc. 


An inspection of Table IV indicates very little difference between 
an immediate and a one week delayed reply to the questions. The 
trend for the immediate reply of the fifteen matched pairs of this table 
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is practically the same as that shown by the thirty-four matched pairs 
of Tables II and III, while the trend for the delayed response is prac- 
tically the same with only a slight loss in answering the digest questions 
(compare 76.1 and 71.7, a percentage difference of 4.4 or a raw score 
difference of 2.2), and a slight disadvantage in answering the original 
questions (compare 74.6 and 67.8, a percentage difference of 6.8 or a 


TaBLE V.—A CoMPARISON OF THE COMPREHENSION SCORES OF FIFTEEN MATCHED 
Pairs, ONE WEEK AFTER READING THE ORIGINAL AND DiGEstT VERSIONS 
OF AN ARTICLE ENTITLED ‘‘A Donor’s DitEemMma’’! 


























Group B reading original version Group A reading digest version 
Comprehension Comprehension 

— Original Digest Individual Original Digest 

Imme-!| Week | Imme-| Week Imme-|} Week | Imme-| Week 

diate | after | diate | after diate | after | diate | after 
13B 70 | 80 82 82 13A 72 32 82 55 
14B 72 77 82 80 14A . 55 62 92 82 
15B 72 | 67 82 80 15A 57 57 77 77 
16B 72 | 80 90 87 16A 65 72 90 85 
17B 80 72 77 80 17A 57 65 77 82 
18B 72 75 90 77 18A 70 80 77 85 
19B 65 72 90 85 19A 72 47 70 72 
22B 67 | 70 85 80 22A 67 70 85 75 
26B 77 75 82 82 26A 62 62 80 97 
27B 60 | 67 65 70 27A 67 70 90 85 
29B 72 72 92 85 29A 67 60 85 82 
30B 75 80 80 85 30A 70 67 90 82 
31B 77.+= «| 77 85 85 31A 75 72 95 85 
32B 77 72 82 80 32A 52 62 80 72 
34B 75 | 80 87 87 34A 60 77 82 87 

Average..| 72 | 74.4 | 83.4 | 81.6 |Average...| 64.5 | 63.6 | 83.4 | 78.8 
































1 The above table is read as follows: Subject 13B, who appears in Tables II and 
III with the same label (13B) and who is the partner of 13A, read the original 
version of ‘‘A Donor’s Dilemma” and immediately afterwards answered seventy 
per cent of forty of the questions covering the original material in the original 
but not in the digest version correctly, while a week later he answered eighty per 
cent of the same questions correctly. He answered eighty-two per cent of tle 
digest questions immediately after reading, while he answered eighty-two per cent 
of the same questions a week later, etc. 
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raw score difference of 3.4). In the preceding comparison however it is 
interesting to note that the advantage of Group A over Group B in 
answering the original questions was greater immediately after reading 
the original version than it was a week later (compare the difference 
between 79(A) and 64.5(B) for the immediate response and 74.6(A) 
and 67.8(B) for the response a week later). 

The results of Groups A and B answering the questions on the 
article entitled ‘‘A Donor’s Dilemma” one week later are presented in 
Table V. These results practically confirm the data in Table IV. 
The average scores for answering digest questions immediately follow- 
ing the reading of the original and digest versions is the same, 83.4; a 
week later there is a very slight advantage for Group B reading the 
original version, (81.6 minus 78.8 is 2.8 or a raw score difference of 1.1 
points). The average scores for answering the original questions 
immediately following the reading of the original and digest versions 
show the same trend as indicated by Tables II and III, that is, some 
advantage for the group reading the original version. This advantage 
is increased a week later, but the increase is only slight because it 
represents a gain of only 1.3 points in raw score. To be specific, 
compare the difference of 7.5 (72 minus 64.5) with the difference of 10.8 
(74.4 minus 63.6). This difference of 3.3 is equivalent to a raw point 
score difference of 1.3. 


INTERPRETATION OF RESULTS 


The four exhibits of data (Tables II-V) derived from the rotation 
of the experimental groups reading two different articles and from a 
second administration of the comprehension questions after an interval 
of a week, all point to one conclusion; namely, an impressive advantage 
for reading the digest rather than the original version of the two 
articles. The nature and extent of this advantage may be judged from 
the discussion which follows. 

In the first place, did the extra material of the original article 
(thirty-four hundred fifty-five extra words in the case of ‘Modern 
Miracle Men” and nineteen hundred eighty-three extra words in the 
case of “‘A Donor’s Dilemma’”’) give the reader of the original article so 
much additional background that he was able to deal more competently 
with the questions covering only the digest version? The data give 
no support to this view. In the second place, how much does the 
person reading the digest lose by not reading the original article? On 
the average out of a possible score of 50, he lost only seven points in 
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reading one article, ‘‘Modern Miracle Men,” and out of a possible 
score of 40, only four and four-tenth points in reading the other article, 
“*A Donor’s Dilemma.”’ The writer opines that the significant result 
here is not that the group reading the original version made an average 
score of four and four-tenths and seven points better than the digest 
reading group, but that without the advantage of the extra nineteen 
hundred eighty-three and thirty-four hundred fifty-five words, respec- 
tively, the digest reading group was able to approach so nearly the 
status of the group reading the original article. 

These two points receive added emphasis when the factor of time is 
considered. For example, in order to secure the advantage of four and 
four-tenths and seven points for the groups reading the original article, 
it required two and six-tenths and four times longer, respectively, to 
achieve this slight superiority. In other words, how much more back- 
ground can the digest group secure by using the time saved in reading 
the digest to read other articles which would otherwise be missed! 

The great advantage of the time saved by the digest-reading group 
may be shown by another approach to the data. The two sets of 
comprehension questions do not overlap but cover different parts of the 
same material. Therefore, the total comprehension score is the sum of 
the scores on the original and digest questions. In Table II, the 
average percentage scores of 78 and 79 for Group A in terms of raw 
scores are 39 and 39.5 respectively. The total comprehension raw 
score for Group A would, therefore, be 39 plus 39.5 or 78.5. That is, 
in order to answer 78.5 questions correctly the members of Group A on 
the average required 1231 seconds to read the original article. This 
means that on the average when Group A read the original article, 
15.7 seconds of reading were required to answer each of the 78.5 ques- 
tions correctly. This figure is secured by dividing 1231 seconds, the 
average time for reading, by 78.5, the average number of questions 
correctly answered. A similar computation for Group B reading the 
digest, yields 4.2 seconds of reading per question (305 seconds divided 
by a total raw score of 72). If we divide 15.7 seconds of reading per 
question for Group A by 4.2 seconds of reading per question for Group 
B the result is 3.7. In other words, Group B reading the digest version 
of ‘‘Modern Miracle Men”’ had 3.7 times an advantage over Group A 
in reading the original article. The same computation for Groups B 
and A in Table III with the article ‘‘A Donor’s Dilemma” gives the 
digest reading group (A) an advantage of 2.5 times over Group B 
reading the original article. 
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The technique of this experiment deals with only one phase of the 
total process of reading. Style, coherence, atmosphere, and other 
important literary qualities may conceivably be lost in the abridgement 
of original materials. It is also possible that a different type of con- 
tent, different kinds of measures, and longer interval of time might 
yield data somewhat at variance with those reported in this investiga- 
tion. Nevertheless, for purposes of ordinary reading where general 
information is sought, the writer submits that the outcome of this study 
suggests that the original version of much current periodical literature 
is padded with excessive and uneconomical verbal baggage. It 
appears, therefore, that whatever danger lurks in the possible loss of 
content as the result of the extravagant deletion of some passage from 
an original article may be more than offset by the time saved in reading 
the digest, to cover other supplementary materials. 


SUMMARY 


Thirty-four pairs of college students were matched on the basis of 
an intelligence and reading test, forming two equivalent groups with 
respect to these measures. One group read the original version of one 
article, while the other group read the digest version of the same article. 
The experimental procedure was repeated by reversing the position 
of the two groups in reading the original and digest versions of a differ- 
ent article. The results of both the immediate and delayed responses 
to two supplementary sets.of comprehension questions are interpreted 
as indicating an impressive advantage for the group reading the digest 
version of the article. 
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THE PREDICTIVE VALUE OF THE MOST VALID ITEMS 
OF AN EXAMINATION 


CHARLES C. GIBBONS 
The Ohio State University 


INTRODUCTION 


The principal problem investigated in this study is the relationship 
that exists between scores on an examination and scores based on only 
the most valid items of the examination. Anderson! found a correla- 
tion of 0.95 between scores on an examination of two hundred twenty- 
two items and scores on the eighty-six most valid items of that exami- 
nation. Anderson suggested that this high correlation may have 
resulted in part from the fact that the coefficient of correlation was 
computed from the data which had been used originally for the selec- 
tion of the most valid items. The present study eliminates this 
difficulty by first selecting the most valid items from the results of the 
administration of the test to one group and then determining the corre- 
lation from the results of the administration of the test to another group. 

A second problem in the study is the extent to which the validity 
and difficulty of items change from one examination to another. 
Andrews and Bird? have found that certain types of objective questions 
are quite stable in validity when repeated with different classes of 
subjects in succeeding years. Besides contributing further evidence 
as to how the validity of items changes, this study investigates the 
stability of the difficulty of items. 


PROCEDURE 


The data for the present study resulted from test analyses of the 
Ohio district-state tests in algebra for 1937 and 1938. These tests, 
prepared by Dr. H. E. Benz and the writer, were administered as a part 
of the testing program of the Ohio State Department of Education. 
The tests were used to select the winners in the algebra scholarship 
contest in each of the five districts of the State of Ohio, and the scores 
from the district tests were used to select the State winners. It is 
important to keep in mind in reading this paper that the groups used 
consist of above-average students from the standpoint of algebra 
ability. Both the 1937 and the 1938 tests consisted of the same forty- 
five items designed to test algebra knowledge and skill taught in the 
first year of algebra instruction. The tests were administered with a 
time limit of sixty minutes. 
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The total four hundred twenty-six test papers were examined in 
1937. The papers were arranged in order of total score. For each 
third of the distribution, a count was made of the number of papers 
containing the correct response to each item. The total number of 
persons getting an item correct was taken as a measure of the difficulty 
of the item. On the basis of these measures of difficulty, the items 
were arranged in order of difficulty on the 1938 test. 

An index of validity for each item was computed by dividing the 
number of correct responses in the upper third of the total group by the 
number of correct responses in the middle third of the total group. 
Lentz, Hirshstein and Finch® found that the ‘‘upper-lower thirds”’ 
method of selecting valid test items is not only the simplest but also the 
best method. The present modification was adopted because the test 
is used to differentiate only among students at the upper end of the 
distribution. The forty-five items were ranked in order of validity 
and the twenty-six most valid items were selected for further study. 
For every third paper in the distribution, the correlation was computed 
between score on the long test and score on this short test consisting of 
the twenty-six most valid items. For the same papers, the split-half 
reliability coefficients were computed for both the long and the short 
tests. 

The analysis of the 1937 test indicated that the test was reasonably 
satisfactory. The test copies were not available to anyone outside the 
State Department of Education. For these reasons, and in considera- 
tion of the research possibilities of such a procedure, the items of the 
1937 test were used as the 1938 test. The items were arranged in order 
of difficulty and the wording of one item was changed slightly. In 
1938, the total four hundred thirty-five papers were examined. Again 
for every third paper in the distribution, a correlation was computed 
between score on the total test and score on the twenty-six items found 
to be most valid in 1937. The split-half reliabilities of both the long 
test and the short test were again computed. The indices of difficulty 
and validity were computed for the items by the same methods as in 
1937. Difficulty of items in the 1937 test was correlated with difficulty 
of the same items in the 1938 test. The same was done for the validity 
indices of the items. 


RESULTS 


The principal aim of this study was to determine what relationship 
exists between scores on an examination and scores based on only the 
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most valid items of the examination. The correlations in Table I 
show this relationship according to thirds of the distribution for both 
the 1937 and the 1938 tests. 

Table II shows the split-half reliability coefficients of the long and 
short tests after the Spearman-Brown formula has been applied to the 
coefficients. The coefficients are based on every third paper in the 
distribution, one hundred forty-two cases in 1937 and one hundred 
forty-five cases in 1938. Table II also contains estimates of the relia- 
bility coefficient of a test which would contain forty-five items similar 
to the twenty-six most valid items. 


TABLE I.—CoORRELATIONS BETWEEN ScoORES ON TOTAL EXAMINATION AND SCORES 
on Most Vauip Items 








Highest | Middle | Lowest 
third | third | thira | Toto 
of group | of group | of group group 
1937 
I oe Pee iyn seceek oe .92 51 46 .90 
1938 
Ce ee ae rT ee .83 .67 44 .90 

















TasBLeE II].—RELIABILITY COEFFICIENTS OF LONG AND SHort TEstTs 





Predicted relia- 
Long test Short test ability of test 
(45 items) (26 items) containing 45 


superior items 





1937 | 1938 | 1937 | 1938 | 1937 | 1938 





For highest third of group....| .56 .55 .54 .57 .67 .70 
For total group. .......2..00. 81 . 86 .69 .82 .79 .89 























It is noteworthy that the coefficients of correlation between total 
test score and score on the most valid items of the test (Table I) are 
higher than the reliability coefficients of the total test (Table II). 
According to this, one should be able to predict a score on the total test 
more accurately by knowing a person’s score on the most valid items 
than he could by knowing the person’s score on a previous administra- 
tion of the long test. This fact suggests that the less valid items impair 
the reliability of a test. Table II shows that the reliability of scores 
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based on the most valid items is practically as high as the reliability of 
scores based on the total test which is considerably longer. A com- 
parison of the reliability coefficients for 1937 with those for 1938 
suggests that a test may become more reliable after its items have been 
arranged in order of difficulty. 

A second problem in this study is the extent to which the validity 
and difficulty of items change from one administration of the examina- 
tion to another. Except for one case of rewording, the items in the 
1938 test were the same as in the 1937 test. The items on the 1938 test 
were arranged in order of difficulty. The correlation between difficulty 
of items in 1937 and difficulty of the same items in 1938 was 0.97. 
From this evidence it would seem that the position of an item in the 
test did not determine how many students answered it correctly. 

The correlation between the validity indices of the items in 1937 and 
the validity indices of the same items in 1938 was 0.58. Investigation 
showed, however, that twenty-two of the twenty-six most valid items 
in 1937 were also among the twenty-six most valid items in 1938. Itis 
possible that the validity rankings of the items changed because of 
their changed positions in the test. In one case the wording of an item 
was changed. The validity of this item changed from thirty-fourth in 
forty-five to seventh. Paterson, Raskin and Schneidler’? have shown 
that the revision of items often results in increasing their validity. 

The correlation between validity and difficulty of items was 0.67 in 
1937 and 0.66 in 1938. As one would expect to be true for the groups 
tested in this study, the more difficult items tended to be more valid. 


SUMMARY AND CONCLUSIONS 


The principal problem investigated in this study is the relation that 
exists between scores on an examination and scores based on only the 
most valid items of the examination. The sources of the data in this 
study were the 1937 and 1938 Ohio district-state algebra tests. The 
twenty-six most valid items on the 1937 test were selected and each 1937 
test paper and each 1938 test paper was given a score on these twenty- 
six items as well as a score on the whole forty-five item test. The 
coefficient of correlation between scores on the whole test and scores on 
the twenty-six most valid items was .90 for both the 1937 and the 1938 
tests. This correlation is higher than the reliability of the whole test. 
From this it would seem that one may be able to predict with consider- 


able accuracy scores on a test from scores on the most valid items of 
the test. 











620 The Journal of Educational Psychology 


Another problem in this study was to determine how the reliability 
of the test and the validity and difficulty of items change from one 
administration of the test to another. A factor that may have influ- 
enced the results was the arrangement of the items on the 1938 test in 
order of difficulty. 

There is some indication that the test became more reliable due to 
the rearrangement of the items in order of difficulty. The reliability 
of the total test rose from 0.81 to 0.86 and the reliability of the short 
test rose from 0.69 to 0.82. 

An index of validity and an index of difficulty were computed for 
each item on the 1937 test. The items were arranged in order of 
difficulty and administered as the 1938 test. The coefficient of correla- 
tion between difficulty of items in 1937 and difficulty of the same items 
in 1938 was 0.97. The items possessed the same relative difficulty 
on the second repetition of the test regardless of the fact that the 
items had been completely rearranged. 

Of the twenty-six most valid items in 1937, twenty-two were among 
the twenty-six most valid items in 1938. For practical purposes, then, 
there was considerable stability in the validity of the items. The 
coefficient of correlation, however, between index of validity in 1937 
and index of validity in 1938 was just 0.58 for all forty-five items of the 
test. There is a possibility that the items changed in relative validity 
because of their new positions in the 1938 test. 
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ON THE PERMANENCE OF COLLEGE LEARNING 
H. L. ANSBACHER 


Brown University 


The recent monograph by Robert I. Watson, ‘‘An Experimental 
Study of the Permanence of Course Material in Introductory Psy- 
chology,” Archives of Psychology, 1938, #225, pp. 64, is of considerable 
interest. It is, according to the author, the only study “‘ which ade- 
quately measures either recognition or recall for delay intervals of 
more than twelve months” (p. 54), the forgetting curves extending for 
almost five years. ‘The subjects were divided into six subgroups which 
were tested for immediate retention of material learned and retested for 
delayed retention at one of various intervals from two to fifty-eight 
months. Since each group was retested only once, relearning due to 
repeated retesting was precluded. In addition a seventh group, a 
control group, which had never studied psychology but was otherwise 
similar in age and intelligence to the experimental groups, was meas- 
ured once on the same tests as the experimental groups, thus allowing 
for comparison. The study distinguished between recall and recogni- 
tion. The tests used were of the objective type. ‘‘ Naming, comple- 
tion, listing questions, and the like, were classed as recall questions 
because they required reproduction of previously learned facts”’ (p. 15). 
True-false and multiple choice questions were classified as recognition 
items, since the observer was required to make a judgment as to facts 
placed before him. 

The present note is concerned with the comparison of that experi- 
mental group which was retested after almost five years (designated 
as group 6) with the control group (designated as group 7) regarding 
their performance on recognition material in particular. 

In testing the difference in performance of these two groups Watson 
obtains from the separate recognition and recall items of the three tests 
employed six separate critical ratios (p. 45, Table 24). Although only 
one of these ratios is above three, the rest are still close enough to three 
to justify the conclusion: ‘‘These data taken in conjunction with the 
fact that the magnitude of the averages always favors the experimental 
group, strongly suggest that the experimental and control groups are 
significantly different and that these averages of group 6 represent 
actual retention measures and not merely scores which might have been 
obtained without the experimental learning situation. In other words, 
complete forgetting had not been reached either in recognition, or 
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recall, after a delay of fifty-eight months” (p. 46). This is a moderate 
claim for retention with which, on the basis of the data, the present 
writer agrees. 

It is, however, with reference to a statement on “‘The Educational 
Significance of the Study”’ (p. 56) that we should like to take excep- 
tion. At this point Watson states that “‘about fifty per cent of the 
material learned is recognized correctly after the lapse of nearly five 
years.” By “material learned” is meant those items which were 
correctly answered at the time of the immediate retention tests during 
or immediately after the introductory psychology course. The reader 
who is primarily interested in the educational significance of the study 
and to whom this section is specifically addressed is likely to be left 
with the impression that the fifty per cent retention refers to material 
learned exclusively in the course. The analysis of the data, however, 
shows that such an interpretation would be false. To be sure, Watson 
makes no specific statement to the effect that retention refers to mate- 
rial learned in the course only, but, on the other hand, he also does not 
take the performance of the control group into account here as he did 
in the statement referred to in the preceding paragraph. This should 
have been done in order to preclude any possibility of misinterpreta- 
tion. In Table I the comparison of the experimental and the control 
groups is carried through. The averages are derived from the scores 
presented by Watson in his Tables 20-22 (pp. 40-42). 


TaBLE I.—AVERAGES OF PERCENTAGES OF TOTAL MATERIAL CORRECTLY 











RECOGNIZED 
Group 6, expest- Group 7, control. Scores group 6 
mental. Knowl- minus scores group 
; Knowledge from 
edge from outside ; : 7. Knowledge 
outside learning 
plus course from course 
‘ only . 
learning learning only 
Immediate retention or 
“material learned”’.... 69.60 20 .03? 49.57 
Retention after five 
years’ delay.......... 33.45 20.03? 13.42 
Ratio of delayed to im- 
mediate retention... . . Soa —s(iéwLS(SUN S's‘ 27.07 














1 Referred to by Watson as ‘‘about fifty per cent)” (p. 56). 
* The one set of scores obtained from Group 7 is, of course, actually neither 
immediate nor delayed retention proper but can for the present purpose be used in 


both places. 
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This comparison makes it clear that the statement “about fifty per 
cent of the material learned is recognized correctly after a lapse of 
nearly five years’’ does not make allowance for outside learning and 
that the figure is reduced to about twenty-seven per cent if such 
allowance is made. This means that of the material learned spe- 
cifically in the course only twenty-seven per cent is retained after five 
years. 

Although this further analysis of the data results in a retention 
figure which is only about one-half of the one given by Watson, the 
writer still agrees with Watson in the implication for educators; 
namely, that data obtained in this way on retention of material in a 
course in introductory psychology are probably not a fair measure or 
the total benefit derived by the student. 


CONCLUSION 


Regarding Watson’s statement that ‘‘about fifty per cent of the 
material learned is recognized correctly after the lapse of nearly five 
years’’ it is to be said: 

(1) This holds true if ‘‘material learned” refers to knowledge 
acquired through a course in introductory psychology as well as outside 
such a course. 

(2) If ‘material learned” is taken to refer exclusively to knowledge 
acquired through such a course the figure is reduced to twenty-seven 
per cent. 
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NOTE ON THE SCORING OF MATCHING TESTS 


EUGENE SHEN 


Harvard University 


Matching tests are usually scored by crediting each correctly 
matched item and ignoring all omissions and errors. As long as all 
items are attempted by every subject, this method of scoring is per- 
fectly fair, even though the score does not necessarily stand for the 
number of items actually known to the subject, since successful guess- 
ing would be credited as well. But when omissions occur, the more 
cautious subjects are penalized. A formula for correcting the effects 
of guessing will therefore be useful. 

The distribution of scores in matching tests obtainable by sheer 
guessing has been discussed in this Journal by Zubin! and by Chapman.? 
If we let ¢ stand for the number of items in the shorter and wu in the 
longer series, the mean number of items guessed correctly is t/u, which 
of course reduces to 1 when the two series are equal in length. The 
method of scoring suggested by both of these writers is to subtract this 
mean number from the number of items correct, in each series of match- 
ing, for each and every subject. Obviously, this method would merely 
transfer the scores to a different scale with exactly the same relative 
positions of the subjects. In other words, the amount of guessing is 
assumed to be constant with different subjects. Omissions and errors 
are treated alike. A subject who gets all the items correct will not get 
a perfect score, and one who makes no response and thus cannot have 
guessed will receive a negative score. These consequences are suffi- 
cient to convince the reader of the inadequacy, if not absurdity, of the 
proposed method. 

But a correct formula is not difficult to devise. It turns out to be 
extraordinarily simple, exactly the same formula for scoring multiple- 
choice tests; namely, 


W 


u-—l 


S=R- 








1 Zubin, J.: ‘The chance element in matching tests.” Jour. Ed. Psycho., Vol. 
XXIv, 1933, pp. 674-681. 

?Chapman, D. W.: ‘The scoring of matching tests with unequal series of 
items.” Jour. Ed. Psycho., Vol. xxvu, 1936, pp. 368-370. 

* Chapman recommends further dividing the difference by the standard devia- 
tion of chance scores. But this is a minor feature which need not detain us. At 
any rate, the same criticisms will apply. 
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where S is the corrected score, R the number of items correct, W the 
number of items wrong, and u the number of alternatives, always in the 
longer series if the two are unequal in length. 

Since evidence of guessing is only to be found in wrong responses, 
it is obvious that a correct scoring formula will necessarily take the 
form R — cW, where c is some constant such that sheer guessing will 
result in a score of zero. Since the mean number of successful guesses 
has already been found to be t/u, the mean number of wrong guesses 


will be ¢ — -. If we now substitute = for R, t-< for W, set 


R — cW to zero, and solve for c, the desired formula is easily obtained. 

To be sure, correction for guessing in the case of matching tests is 
subject to the same sort of limitations as in the case of true-false and 
multiple-choice tests. From a purely logical or statistical point of 
view, the exactness of correction is to be determined by the variability 
of success in guessing. Chapman’s formula! for the standard deviation 
of a distribution of guesses is very useful in this respect. It shows that 
correction increases in exactness as ¢ diminishes ‘and as wu increases. 
The lower limit of ¢ is of course one, which is attained when a matching 
test becomes a multiple-choice item. Guessing is thus a more disturb- 
ing factor in matching than in multiple-choice tests, for equal u values. 
On the other hand, larger u’s are more feasible in matching than in 
multiple-choice tests, and, for the same u, matching tests generally 
take less time and space than multiple-choice tests. The choice 
between matching tests and multiple-choice tests should, therefore, 
rest more on psychological and pedagogical than on logical or statistical 
considerations. 
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PUPIL PERFORMANCES ON THE ABBREVIATED AND 
COMPLETE NEW a SCALES, 
FORM L 


W. C. KVARACEUS 


Educational Consultant, School Department, Brockton, Mass. 


It is a commonly accepted conclusion on the part of most users of 
the Binet scales that these mental examinations are always subject to 
definite limitations. One of the outstanding limitations of the Binet 
scale, or any other test for that matter, lies in the fact that it measures 
only a subject’s performance or behavior at the moment and under 
certain conditions. From this performance or behavior we infer, with 
some probability of error, the existence of a certain definite mental 
capacity. Inasmuch as mental capacity so measured simply reports a 
certain potentiality for behavior, the question of possible variability 
which might be due to different amounts or kinds of sampling must con- 
stantly be kept in mind when interpreting test results. 

In the construction of the test, the author has the problem of 
adequately sampling the behavior of the examinee in order to insure a 
fair representation of the individual’s behavior as a whole. Obviously 
the reliability of a test may be improved by increasing the number of 
test items, thus insuring a more representative sampling. Since it is 
impractical as well as impossible to measure all behavior, the question 
of the number and kind of test items which will represent adequate 
sampling and, at the same time, yield a sufficiently reliable result, is of 
paramount importance. 

Accepting the choice of items on the Binet scale as representative of 
significant behavior as a whole, we have the alternatives of examining 
with the abbreviated form or the complete scale. Testing with the 
newly revised Stanford-Binet is a time-consuming process. The use 
of the abbreviated scale would result in an appreciable saving in the 
examiner’s time. The question arises: Would the saving in time 
involve too large a sacrifice in the reliability of the test itself? Terman 
and Merrill report that ‘‘The probable error of the IQ based on an 
abbreviated test is about twenty per cent higher than for the complete 
scale.”’! 

The following study reports the extent of the IQ differences of two 
hundred fourteen school children according to their performance on the 





1 Terman, Lewis M. and Merrill, Maud A.: Measuring Intelligence, p. 31. 
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complete and abbreviated Form L scale. In all one hundred sixty-two 
boys and fifty-two girls ranging in CA from three to nineteen years 
were tested by the author in the course of the last three years. 

All these subjects had been tested with the complete scale. Utiliz- 
ing the same test results, the performance of each individual was scored 
according to the abbreviated scale. Comparisons of IQ and mental 
age resemblances between the two scorings of each individual and of 
the total group were then possible. 


TaBLe [.—Distrimvutions or Two HunpRED FourTEEN Boys AND GIRLS ON THE 
STANFORD-Binet, Form L 





























| Median | Q 
Chronological Age 
I a ae oo 4s ed baa: dun bd aa ae Eat ene 10.4 yrs. | 2.5 yrs. 
Intelligence Quotients 
EEE, FE OTOL TEER OTE ICT: 89.4 11.2 
oe gee ae ele ca ead ade aed 90.7 12.5 
Mental Ages 

i ba, aed ul dundee ake mae ib pend eae 8.4 yrs. | 2.5 yrs. 
SN SE A ee ey i UR PE A Mane gn HAP 8.8 yrs. 3.9 yrs. 











A study of Table I reveals that the group did not represent a 
random sampling. Certain selective factors were operative inasmuch 
as these cases were referred to the author for examination because of 
some academic, social, or personality maladjustment. While the 
members of the group came from all grade levels, they represented a 
low normal sampling. 

When the results of the two scales are compared, an increase in 
mental age and IQ is readily seen when the pupil responses were meas- 
ured on the complete scale. On the average, pupils’ IQ’s and mental 
ages showed a tendency to be somewhat lower on the abbreviated scale. 
A difference of 1.3 points in IQ or .43 of a year in mental age was noted 
between medians obtained from both scales. Likewise, a greater 
dispersion was found when the complete test was administered. 

The IQ and mental age differences of every individual were also 
investigated. The mean of the IQ differences was found to be 2.85 IQ 
points with a standard deviations of 2.7. In terms of months, the 
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mean mental age difference was 3.44 with a standard deviations of 3.07 
months. A glance at the frequency distributions revealed only two 
persons who varied more than ten points and twenty-five who varied 
more then five points. 


TaBLe I].—Txe Extent anp Direction or IQ anp MA Dirrerences or Two 
HunprED FourtTeEN Boys anp Girits AccorDING TO THEIR PERFORMANCES 
ON THE ABBREVIATED AND CompLeTE StanrorD-Binet Scaizs, Form L 
















































































| N Per cent | Median | Q 
Months 
SER os ou did'e cok moe de bn eee aa ae 115 .54 2 3 
BEL. < ckceeeene bs dua heuketerias 72 34 2 2 
NS. « ccbbintts cc tnwoetaentnte 214 
IQ Points 
Gd oven wuinciekdak he eubemuee 116 .54 2 2 
Pc cdcies ss cakes end eneucaebare 67 31 2 2 
Nc icbsk's 64009 6s deereeetateud 214 
1Q’s above 90 
ee lies aces oulk sdmadides need ee 53 .50 3 1.5 
Da ikio6-¢aes0uensh eke eeeae 36 .34 3 1.5 
rita tethevedetacvivecakks 106 
1Q’s below 91 
ee al pl ta 61 .55 3 2 
Suis Siti thies caus ax denn 34 31 3 2 





edad as cuaWin sobekthtcuean 108 

















1 Besides the gains and losses, the total includes the number who made no 
change. 


Table II reveals that a high percentage of the differences were gain 
differences when total performance was scored. No significant varia- 
tion is to be noted between the higher and lower IQ levels as to the 
percentage who gained and lost on the IQ scale. A slight tendency 
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toward greater dispersion on the lower IQ levels of the group was 
evident. 


SUMMARY 


(1) This study reports an investigation of the differences in IQ and 
mental age of two hundred fourteen boys and girls as noted between 
their performance on the abbreviated and complete Stanford-Binet 
scales, Form L. 

(2) The group showed a distinct tendency to gain in IQ and MA 
when tested on the complete scale. The difference of 1.3 points and 
.43 of a year was observed between medians obtained on both scales. 

(3) A more marked dispersion was noted when the pupils were 
scored according to their performance on the total test. 

(4) The mean of the IQ differences between both scales was 2.85 
points with a standard deviation of 2.7 for the group; in terms of mental 
age the mean of the differences was 3.44 with a standard deviation of 
3.07 months. 

(5) At the same time, a slight tendency toward greater dispersion 
was found in the lower IQ group. The same percentage of losses and 
gains was noted among high and lower IQ levels as was found for the 
total group. 
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BOOK REVIEWS 


J. P. Gumrorp, Editor. Fields of Psychology. New York: D. Van 
Nostrand Co., Inc., 1940, pp. 695. 


The Fields of Psychology under the editorship of Dr. J. P. Guilford 
and written by twelve of Americas well-known psychologists will bring 
to many teachers another “‘long felt need” as a well-organized text; to 
others it will find its place on a thoughtfully selected reference shelf; 
and to some it will be “‘just another publication.” 

The twenty-two chapters of this volume follow a sequence which is 
somewhat arbitrary. For those teachers who may wish to reorganize 
the several sections to meet individual adaptations such can be done 
without much loss of continuity. Chapter I (J. P. Guilford) shows the 
place of psychology in relation to the other biological and social 
sciences. Chapters II and III (C. J. Warden) present the viewpoint, 
program, methods and results of animal psychology. Chapters IV and 
V (Mary Shirley) consider the field and products of child psychology. 
Chapters VI, VII, and VIII (Daniel Katz) deal with social psychology, 
crowd behavior and the psychology of nationalism. Chapters IX, X, 
and XI (Laurance F. Shaffer) present abnormal psychology in rela- 
tion to its significance and causes, and discuss the minor and major 
abnormalities. Chapters XII and XIII (Anne Anastasi) treats of 
differential psychology from the standpoint of the nature of individual 
differences and major group differences. Chapter XIV (Horace B. 
English) on educational psychology discusses child nature, intellectual 
and personality development, and the problems of learning in relation to 
the educational process. Chapter XV (C. M. Louttit) deals with the 
problems and methods of clinical psychology. Chapter XVI (Douglas 
Fryer) treats of individual mental efficiency in relation to working 
conditions, motivation, fatigue, etc. Chapters XVII and XVIII 
(Morris 8S. Viteles) present the many different aspects of vocational 
psychology. Chapter XIX (Douglas Fryer) considers the varied fields 
of professional psychology not included in other chapters of this 
volume. Chapter XX (G. L. Freeman) shows the importance of 
physiological psychology to an understanding of man-as-a-whole. 
Chapter XXI (Kate Kevner) deals with aesthetics as a part of the 
subject-matter of scientific psychology. Chapter XXII (Milton 
Metfessel) closes the volume with the systematic points of view gener- 
ally referred to as the schools of psychology. 
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This text, written primarily to meet the needs of a second course in 
psychology, offers the student not only an acquaintance with the 
several fields of psychology, but an appreciation of the degree of 
specialization now current in the science. Although it is evident that 
each author is immersed in his own particular field, the volume is not 
overburdened with technical language nor are individual interests 
overemphasized. Competent editorship has made this text a readable 
volume—something so often not found in collaborated writings. 
Where subject-matter of necessity overlaps, differences of interpreta- 
tion may possibly be the source of confusion to some readers. One 
may, for example, be led to believe that there exists a significant rela- 
tionship between “body types” and “‘temperaments”’ when one reads 
p. 271 and be left in doubt about this relationship when he turns to 
p. 579. However, the book is virtually free of direct contradictions. 

Some teachers will feel that too much attention is given to one field 
of psychology at the sacrifice of some other. The editor admits to the 
omission of a chapter on experimental psychology since—“ All fields are 
becoming more and more experimental, and so the term is coming to 
refer to a method rather than a field.”” But does this justify the 
omission when experimental psychology is still a formal course in most 
psychology curricula? It is inevitable, for example, that some may 
deplore the paucity of material on sensory experience and others wish 
for a more unified treatment of psychoanalysis. However, in the 
reviewer’s opinion a conscientious perusal of this book will put the 
elementary psychology student in a good position to appreciate 
the various fields of psychological inquiry. 

B. voN HALLER GILMER. 
Carnegie Institute of Technology. 


JosepH TIFFIN, FREDERIC B. KNIGHT, AND CHARLES C. Josey. The 
Psychology of Normal People. Boston: D. C. Heath and Co., 
1940, pp. 512. 


“This book is written for students who expect their four years at 
college to prepare them for acceptable service in business, industry, and 
the professions.”” In writing the book, use was made of material 
derived from counsel with industrial superintendents, personnel mana- 
gers, hospital physicians, and educational administrators and teachers. 
The authors rightly feel that their own, or any, text can not be 
employed with equal advantage in all first courses. 
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The chapters follow somewhat the traditional pattern, including 
such subjects as individual differences, personality, emotions, atten- 
tion, learning, intelligence, perception, imagination and reasoning. 
Very little is said about physiological bases of behavior, sensation, and 
genetic development of the individual. The chapter summaries, which 
are well-done, will be found helpful to the student. Throughout the 
book there is a strong emphasis upon development of personality and 
upon behavior as a psychological whole. The discussions are replete 
with practical illustrations, many of them from abnormal behavior. 
Furthermore, the authors do not adhere closely to an organization of 
materials into units suggested by chapter headings. Thus a given 
discussion frequently is more intimately tied in with other sections. 

Some readers will gain the impression that the text is lacking in 
substance. Much of the discussion is based upon citations from books 
rather than upon data of original contributions. Furthermore, in 
several instances there is uncritical acceptance of a writer’s conclusions 
that are not yet established with a reasonable degree of certainty or are 
even put in a questionable status by other research. Examples of this 
are: (1) Uncritical acceptance of the view that environment can cause 
marked changes in the IQ. No attention is given to negative evidence 
or to falacies of methodology in the studies from which their conclu- 
sions are drawn. (2) Acceptance of an unsubstantiated view that 
physiological disability is an important determinant of reading 
disability. (3) Stating that four per cent of men are color blind 
when reliable evidence indicates that the figure is seven to eight 
per cent. 

Because of its organization and emphasis, this text might well have 
been called a psychology of adjustment. For the most part, students 
will not only find it interesting reading, but easy to evaluate in terms 
of their own experiences. The book will be found most useful in those 
situations where the aim is to facilitate adjustments in practical situa- 
tions rather than in courses which are designed as fundamental to 
further work in the field of psychology. Mies A. TINKER. 

University of Minnesota. 


FLORENCE M. TEAGARDEN. Child Psychology for Professional Workers. 
New York: Prentice-Hall Co., 1940, pp. 641. 


Psychologists are well aware that their colleagues’ interests in 
children have accumulated a sizeable body of knowledge concerning 
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child development and behavior. This material has frequently been 
organized into textbooks for use in college classes. Except for several 
attempts to write on child psychology especially for teachers, there 
has been little effort to interpret the psychology of childhood for 
other professional workers. As a result of teaching classes in child 
psychology to social workers, Dr. Teagarden has written a book 
on child psychology with the needs of the case-worker constantly in 
mind. 

She begins with a discussion of heredity especially as it applies to 
the case load. This is followed by consideration of parturition, 
infancy, and the preschool period. Special attention is given to the 
formation of fundamental physical habits, such as eating, sleeping, 
dressing and toilet habits. The child’s place in his own home and 
foster and adoptive homes are especially well discussed. There are 
chapters on childrens’ emotions, sex life, intelligence, and the child’s 
psychological relation to the school. The last four chapters are 
devoted to problem behavior: Delinquency, that incident to children’s 
diseases, that associated with sensory handicaps and motor crippling, 
and speech defects. 

The author has very intelligently integrated her summary of an 
extensive literature. Constantly throughout the book it isevident that 
Teagarden has long worked with social case workers and has an appre- 
ciation of their problems. Each chapter has an extensive bibliography, 
usually including seventy or more references. 

It is the opinion of the reviewer that this book should prove espe- 
cially valuable for social workers and for students in schools of social 
work. Although in one place Teagarden implies that it would not be 
suitable for the usual college student, the reviewer is not inclined to 
agree. C. M. Lovurtir. 

Indiana University. 


A. J. Jonss, E. D. Grizzett AND W. J. Grinsteap. Principles of Unit 
Construction. New York: McGraw-Hill Book Co., 1939, pp. 232. 


The material presented in this book is concerned with a crucial 
problem in the field of education. To construct a unit of learning is 
neither a simple nor a small task. The material reported here is based 
upon ten years of experience in teaching and experimental work. The 
treatise undoubtedly marks an important and fundamental forward 
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step in a field that has been characterized by uncertainty and much 
misdirected effort. 

The authors state that “the unit of learning consists of a group or 
chain of planned, coérdinated activities undertaken by the learner in 
order to obtain control over a type of life situation. The unifying 
principle is . . . the learning product to be achieved.” Aspects of 
unit construction considered include psychological basis, essential 
elements, place of the teacher, unit planning, motivation, guidance, and 
measurement of progress and attainment. The appendix lists in some 
detail a number of sample units. 

Although all parts of the discussion are essential and form a well- 
coordinated treatise, some sections are outstanding. These include 
chapters on essential elements of the unit and place of the teacher in 
the unit. The sample units will be found very useful as illustrations 
of points made in the text although in some instances they serve as 
illustrations of things to be avoided. Although not emphasized as 
such by the authors, educational guidance can and should be codrdi- 
nated with the use of units of learning. 

Some will consider that parts of the discussion are too general to be 
of maximum value to the teacher who wishes to develop and employ 
units of learning. Furthermore, it might be emphasized that success- 
ful use of units of learning depends largely upon intelligence of the 
teacher, adequate teacher training, and conviction on the part of the 
classroom teacher that units of learning are superior to certain other 
educational procedures. To what extent will the present group of 
classroom teachers fulfill these requirements? 

This text is clearly written and its materials well-integrated. In 
terms of its probable influence on future educational practice, it should 
be considered as one of the more important educational books of the 
year. Mixes A. TINKER. 

University of Minnesota. 


LAWRENCE A. AVERILL. Mental Hygiene for the Classroom Teacher 
New York: Pitman Publishing Corp., 1939, pp. 217. 


The in-service teacher who has had no previous training in mental 
hygiene will find in this interestingly written book a convenient sum- 
mary of the principles of mental hygiene tailored specifically for the 
elementary-school teacher and his problems. 
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The first part of the book is devoted to an analysis of the sources of 
conflict in the teacher and a description of healthy and unhealthy 
techniques of adjustment. Subsequent chapters consider the mental 
hygiene aspects of teacher-student, teacher-colleagues and teacher- 
community relationships. JamMEs D. PaGes. 

The University of Rochester. 
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